Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When to build your own buffer system for I/O (C++)?

I have to deal with very large text files (2 GBs), it is mandatory to read/write them line by line. To write 23 millions of lines using ofstream is really slow so, at the beginning, I tried to speed up the process writing large chunks of lines in a memory buffer (for example 256 MB or 512 MB) and then write the buffer into the file. This did not work, the performance is more or less the same. I have the same problem reading the files. I know the I/O operations are buffered by the STL I/O system and this also depends on the disk scheduler policy (managed by the OS, in my case Linux).

Any idea about how to improve the performance?

PS: I have been thinking about using a background child process (or a thread) to read/write the data chunks while the program is processing data but I do not know (mainly in the case of the subprocess) if this will be worthy.

like image 367
Bocaballena Avatar asked Nov 06 '08 09:11

Bocaballena


People also ask

What are the reasons to provide I O buffering?

Uses of I/O Buffering : Buffering is done to deal effectively with a speed mismatch between the producer and consumer of the data stream. A buffer is produced in main memory to heap up the bytes received from modem.

What is a buffer in operating system?

A buffer is a data area shared by hardware devices or program processes that operate at different speeds or with different sets of priorities. The buffer allows each device or process to operate without being held up by the other.

What is buffering in C++?

A buffer is temporary storage of data that is on its way to other media or storage of data that can be modified non-sequentially before it is read sequentially. It attempts to reduce the difference between input speed and output speed.


1 Answers

A 2GB file is pretty big, and you need to be aware of all the possible areas that can act as bottlenecks:

  • The HDD itself
  • The HDD interface (IDE/SATA/RAID/USB?)
  • Operating system/filesystem
  • C/C++ Library
  • Your code

I'd start by doing some measurements:

  • How long does your code take to read/write a 2GB file,
  • How fast can the 'dd' command read and write to disk? Example...

    dd if=/dev/zero bs=1024 count=2000000 of=file_2GB

  • How long does it take to write/read using just big fwrite()/fread() calls

Assuming your disk is capable of reading/writing at about 40Mb/s (which is probably a realistic figure to start from), your 2GB file can't run faster than about 50 seconds.

How long is it actually taking?

Hi Roddy, using fstream read method with 1.1 GB files and large buffers(128,255 or 512 MB) it takes about 43-48 seconds and it is the same using fstream getline (line by line). cp takes almost 2 minutes to copy the file.

In which case, your're hardware-bound. cp has to read and write, and will be seeking back and forth across the disk surface like mad when it does it. So it will (as you see) be more than twice as bad as the simple 'read' case.

To improve the speed, the first thing I'd try is a faster hard drive, or an SSD.

You haven't said what the disk interface is? SATA is pretty much the easiest/fastest option. Also (obvious point, this...) make sure the disk is physically on the same machine your code is running, otherwise you're network-bound...

like image 68
Roddy Avatar answered Sep 27 '22 19:09

Roddy