Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimal Buffer size for read-process-write

In my function, I need to read some data from a file into a buffer, manipulate the data and write it back to another file. The file is of unknown size and may be very large.

If I use a small buffer, there will be a long read/write cycle and it would take much time. In contrast, long buffer means I need to consume more memory. What is the optimal buffer size I should use? Is this case dependent?

I saw some application like 'Tera copy' in windows that manages huge files efficiently. Is there any other technique or mechanism I should be aware of?

Note: This program will be running under Windows.

like image 555
Dipto Avatar asked Mar 21 '13 06:03

Dipto


4 Answers

See what Microsoft has to say about IO size: http://technet.microsoft.com/en-us/library/cc938632.aspx. Basically, they say you should probably do IO in 64K blocks.

On *NIX platforms, struct stat has a st_blksize member which says what should be a minimal IO block size.

like image 83
wilx Avatar answered Sep 23 '22 02:09

wilx


It is, indeed, highly case dependent, and you should probably just write your program to be able to handle a flexible buffer size, and then try out what size is optimal.

If you start small and then increase your buffer size, you will probably reach a certain size after which you'll see no or extremely small performance gains, since the CPU is spending most of its time running your code, and the overhead from the I/O has become negligible.

like image 34
Dolda2000 Avatar answered Sep 26 '22 02:09

Dolda2000


First rule for these things is to benchmark. My guess would be that you prematurely optimizing. If you are doing real file IO, the bandwidth of your disk (or whatever) will usually be the bottleneck. As long as you write your data in chunks of several pages the performance shouldn't change too much.

What you could hope to is to do your computation of parts of the data in parallel to your write operation. For this you would have to keep two buffers, one which is currently written, and one on which you do the processing. Then you would use asynchronous IO funcions (aio_write on POSIX systems, probably something like that exists for Windows, too) and switch buffers for each iteration.

like image 2
Jens Gustedt Avatar answered Sep 24 '22 02:09

Jens Gustedt


Memory management is always case dependent and particularly when combined with file I/O.

There are two possible suggestions from my side.

1) Use fixed I/O buffer size, e.g. 64K, 256K, 512KB or 1MB. But in this case when there is I/O more than this fixed buffer size, you have to consider offsets to complete I/O in multiple iterations.

2) Use variable I/O buffer size using malloc(), but this also depends on certain factors. Such as available RAM in your system and maximum dynamic memory allocation limit for process in your OS.

like image 1
Kinjal Patel Avatar answered Sep 26 '22 02:09

Kinjal Patel