Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

file IO performance C

Tags:

c

io


I have a question regarding file IO (C language) and its performance issues.

I have an application that does a lot of file I/O (over its lifetime ~3-6hours, about 0.5-0.75TB, of mostly file output). At the moment my application sprintf()s everything into a char string and at the end of a line write()s, to a file_descriptor. My string is 1024 characters in length, but can vary anywhere from 64 to 1024. At any rate.

The question is:
Would it make more sense to make a larger char string (say, 1MB?) and sprintf() everything into it before doing the write()? Or does it make more sense to skip sprintf() completely and simply write() directly to the file, assuming the buffering is taken care of by write()?

Something I thought of, but unsure if it'll actually accomplish anything in terms of performance:
What if I had a structure where I store the individual parts of the string, numbers and strings and do a mem_copy of the structure instead? I'm guessing similar to a binary write?

I'm trying to achieve a "buffered" approach or anything that will maximize performance. The later is that I need to use that file for further processing. Any suggestions?

EDIT
I did some simple performance comparison with printf(); + redir and sprintf(); write();
I'm simply copying ~20GB to a file.

char string[1024];

for(i=0;i<(1<<20)*20;i++)
  printf("%s",string);

~/tmp/tests$ time ./printf.out > testing
real   2m22.101s
user   0m28.214s
sys    0m29.294s

as opposed to:

char string14[256]; ...etc
for(i=0;1<<(1<<20)*20;i++){
  sprintf(dst_string,"%s%s",dst_string, string14);
  sprintf(dst_string,"%s%s",dst_string, string24);
  sprintf(dst_string,"%s%s",dst_string, string34);
  sprintf(dst_string,"%s%s",dst_string, string44);
  write(fd, dst_string, 1024);
}

~/tmp/tests$ time ./write.out 

real   1m48.206s
user   0m58.544s
sys    0m41.079s

The reason for multiple sprintf()s is to simulate copy->buffer and then write buffer. The time (real anyways) is not as insignificant as some comments may suggest. Granted this is a simple example and maybe in the scheme of computation + IO maybe it won't be.

The thing that I'm a bit confused over the in the printf example, where did that extra time go? user+sys don't add up to real, shouldn't they at least be in the ballpark? Because there is a whole 1:30m missing.

Does this test show any conclusions? sprintf + write > simply print+redir?

Anyways, thank you all for the comments.

like image 594
janjust Avatar asked Jan 31 '13 15:01

janjust


1 Answers

When I did some testing on my machine, I got about 60MB/s out of my not-so-modern hardware. That's 3.6GB/minute or 216GB per hour (so 3hours produces about 640GB). I would expect that the time spent in your application is mostly "waiting for the disk", in which case, it makes absolutely no difference what IO methods you use.

But like ALL performance questions, it's not an answer you can find by asking on the internet, or look it up in a book, or whatever. It has to be measured on the system(s) that you are concerned about. Change my skanky old hard-disks for some nicely configured RAID, and you get a much better performance [if it's the right kind of raid system - some are slower than individual disks, as the intention isn't to speed up access but ensure reliability].

You can also make some comparisons: 1. redirect your software's output to /dev/null - check how long it takes to run your code now. If it's something like 10-100x faster than when you are writing to files, then you know that the way you write now or some other method won't make any difference at all. 2. Create similar sized files with dd if=/dev/zero of=yourfile bs=4k count=largenumber (largenumber * 4KB = typical file-size) - if your application is writing several files, then write a script that writes several different files like that). If that's much faster than your application, then there's something to be gained by altering the way you do output from your application.

If either of the two things above show that there's a potential for gain, then write some benchmarks that produce a large amount of output in the same way you intend your application to work and see what makes a difference. By all means come back here and ask questions. But my guess is that your application won't run any faster or slower no matter what you do about output mechanisms, because it's all down to "how fast the disk can write".

like image 105
Mats Petersson Avatar answered Oct 24 '22 03:10

Mats Petersson