Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient Way to Process Simple but Large Files in C++

Tags:

c++

I'm working on a project that has me a bit over my head performance-wise. I'm tasked with reading large (50MB or so) files of particle coordinates and displaying them. I'd like to use C++ for this because I am learning it already.

The coordinate structure in the files are simple, there's just alot (say a million or so):

1234.5667 5234.1566 //coordinate 1  
8532.6123 5152.6612 //coordinate 2  
....

Being a noob, I just want to read in the files line by line and store them in vectors, is this wrong? Maybe I should be reading in the whole file first (buffered?), and then parsing the values?

Working example:

clock_t c1 = clock();
vector<double> coords;
double coord;
ifstream fin("file.txt");
while(fin >> coord) {
    coords.push_back(coord);
}
cout << "done. " << coords.size()/2 << " coords read.\n";
cout << "took " << (clock() - c1)/(double)CLOCKS_PER_SEC << " seconds." << endl;

And corresponding output on a 40MB file with 2 million coordinates:

done. 2000000 coords read.
took 1.74 seconds.

Which is fast in my mind, but I'm thinking my mind isn't a good judge.

like image 723
swajak Avatar asked Jan 16 '10 17:01

swajak


3 Answers

You might want to preallocate the vector using .reserve if you have an idea of how large the "average" file is.

Efficiency is a tricky game. Don't play tricks early on, and design a good basic algorithm. If it's not fast enough, you start looking at the IO routines, whether you're creating any "extra" objects (explicitly or implicitly, especially if you're passing parameters around).

In your example, you might want to do a second call to clock() before printing the summary output -- get a slightly more accurate timing! :)

like image 150
Joe Avatar answered Nov 19 '22 18:11

Joe


If your input file has fixed line widths (chars per line), then hauling in data into a buffer becomes easier. For variable line width files, determining when to refill a buffer becomes tricky.

The Single Buffer Method

Allocate a buffer of char, size = (line width) * number of lines to store. Or you could work backwards: Max lines in buffer = (buffer size) / (line width). Remember that line width may contain '\n', '\r' or both.

Read in the entire buffer using fread or istream::read, otherwise known as block reading. Maintain a pointer to the "next line" in the buffer. Increment the pointer after reading, if it goes beyond the end of the buffer, read in another block. Repeat as necessary.

The Double Buffer Method

Allocate two buffers. Read one buffer. Start processing data in first buffer. Read into second buffer while processing first. When first buffer processing is finished, start processing second buffer and read into first. Repeat as necessary.

This method relies on multi-processing: reading while processing data. This can be implemented as two threads, or non-blocking reading (reading operation doesn't wait for completion before returning). With threads, one thread processes while the other thread reads. Use semaphores to single end of processing and buffer ready.

Multiple Buffer Method

Similar to double buffer, but more buffers are allocated. Consider this a race game. The reading task is trying to stay ahead of the processing task. The processing task should wait until N buffers are read before starting. The objective is to allocate enough buffers so the processing task doesn't wait for the reader task.

Optimization: Loop Unrolling

You may also want to consider Loop Unrolling as another speed optimization. Duplicate the code in the loops many times to reduce the number of iterations.

I usually wait until after the program works correctly before applying these techniques unless the program is incredibly slow, such as program requiring 1 hour to process a 2GB file (it was optimized down to 2 minutes).

like image 40
Thomas Matthews Avatar answered Nov 19 '22 19:11

Thomas Matthews


Optimization should start from measurement. Described code does three things:

  1. reads some bytes from file
  2. parses them as text representation of doubles
  3. stores them in vector

Time share for each part should be measured.

For #1 it can be prereading file contents into memory (50M should fit and not cause swapping) and then running algorithm with time measuring. Difference will show how much takes reading from disk.

For #2 file structure temporarily can be switched to sequence of binary representations of doubles. Difference will show how much takes parsing.

For measurement of #3 std::vector::reserve could be used (actually as it has already been suggested, this could be not only measurement tool, but also optimization).

Personally I would not expect that #3 will take significant time, but measurement is better than guessing.

After detecting most time consuming part efforts may become more focused and thus more effective...

like image 1
maxim1000 Avatar answered Nov 19 '22 19:11

maxim1000