Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dealing with very large datasets & just in time loading

I have a .NET application written in C# (.NET 4.0). In this application, we have to read a large dataset from a file and display the contents in a grid-like structure. So, to accomplish this, I placed a DataGridView on the form. It has 3 columns, all column data comes from the file. Initially, the file had about 600.000 records, corresponding to 600.000 lines in the DataGridView.

I quickly found out that, DataGridView collapses with such a large data-set, so I had switch to Virtual Mode. To accomplish this, I first read the file completely into 3 different arrays (corresponding to 3 columns), and then the CellValueNeeded event fires, I supply the correct values from the arrays.

However, there can be a huge (HUGE!) number of records in this file, as we quickly found out. When the record size is very large, reading all the data into an array or a List<>, etc, appears to not be feasible. We quickly run into memory allocation errors. (Out of memory exception).

We got stuck there, but then realized, why read the data all into arrays first, why not read the file on demand as CellValueNeeded event fires? So that's what we do now: We open the file, but do not read anything, and as CellValueNeeded events fire, we first Seek() to the correct position in the file, and then read the corresponding data.

This is the best we could come up with, but, first of all this is quite slow, which makes the application sluggish and not user friendly. Second, we can't help but think that there must be a better way to accomplish this. For example, some binary editors (like HXD) are blindingly fast for any file size, so I'd like know how this can be achieved.

Oh, and to add to our problems, in virtual mode of the DataGridView, when we set the RowCount to the available number of rows in the file (say 16.000.000), it takes a while for the DataGridView to even initialize itself. Any comments for this 'problem' would be appreciated as well.

Thanks

like image 277
SomethingBetter Avatar asked Jan 26 '11 16:01

SomethingBetter


People also ask

How would you deal with a dataset that was too large to load into memory?

Money-costing solution: One possible solution is to buy a new computer with a more robust CPU and larger RAM that is capable of handling the entire dataset. Or, rent a cloud or a virtual memory and then create some clustering arrangement to handle the workload.

Which algorithm is best for large datasets?

the Quick sort algorithm generally is the best for large data sets and long keys.


1 Answers

If you can't fit your entire data set in memory, then you need a buffering scheme. Rather than reading just the amount of data needed to fill the DataGridView in response to CellValueNeeded, your application should anticipate the user's actions and read ahead. So, for example, when the program first starts up, it should read the first 10,000 records (or maybe only 1,000 or perhaps 100,000--whatever is reasonable in your case). Then, CellValueNeeded requests can be filled immediately from memory.

As the user moves through the grid, your program as much as possible stays one step ahead of the user. There might be short pauses if the user jumps ahead of you (say, wants to jump to the end from the front) and you have to go out to disk in order to fulfill a request.

That buffering is usually best accomplished by a separate thread, although synchronization can sometimes be an issue if the thread is reading ahead in anticipation of the user's next action, and then the user does something completely unexpected like jump to the start of the list.

16 million records isn't really all that many records to keep in memory, unless the records are very large. Or if you don't have much memory on your server. Certainly, 16 million is nowhere near the maximum size of a List<T>, unless T is a value type (structure). How many gigabytes of data are you talking about here?

like image 188
Jim Mischel Avatar answered Sep 17 '22 19:09

Jim Mischel