Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which are the best practices for data intensive reading and writing in a HD?

I'm developing a C++ application (running in a Linux box) that is very intensive in reading log files and writing derived results in disk. I'd like to know which are the best practices for optimizing these kind of applications:

  • Which OS tweaks improve performance?
  • Which programming patterns boost IO throughput?
  • Is pre-processing the data (convert to binary, compress data, etc...) a helpful measure?
  • Does chunking/buffering data helps in performance?
  • Which hardware capabilities should I be aware of?
  • Which practices are best for profiling and measuring performance in these applications?
  • (express here the concern I'm missing)

Is there a good read where I could get the basics of this so I could adapt the existing know-how to my problem?

Thanks

like image 441
tonicebrian Avatar asked Jan 25 '11 14:01

tonicebrian


4 Answers

Compression may certainly help a lot and is much simpler than tweaking the OS. Check out the gzip and bzip2 support in the Boost.IOStreams library. This takes its toll on the processor, though.

Measuring these kinds of jobs starts with the time command. If system time is very high compared to user time, then your program spends a lot of time doing system calls. If wall-clock ("real") time is high compared to system and user time, it's waiting for the disk or the network. The top command showing significantly less than 100% CPU usage for the program is also a sign of I/O bottleneck.

like image 126
Fred Foo Avatar answered Nov 18 '22 00:11

Fred Foo


1) Check out your disk's sector size.
2) Make sure the disk is defragged.
3) Read data that is "local" to the last reads you have done to improve cache locality (Cacheing is performend by the operating system and many hard drives also have a built-in cache).
4) Write data contiguously.

For write performance, Cache blocks of data in memory until you reach a multiple of the sector size then initiate an asynchronous write to disk. Do not overwrite the data currently being written until you can be definite the data has been written (ie sync the write). Double or triple buffering can help here.

For best read performance you can double buffer reads. So lets say you cache 16K blocks on read. Read the 1st 16K from disk into block 1. Initiate an asynchronous read of the 2nd 16K into block 2. Start working on block 1. When you have finished with block 1 sync the read of block 2 and start an async read into block 1 of the 3rd 16K block into block 1. Now work on block 2. When finished sync the read of the 3rd 16K block, initiate an async read of the 4th 16K into block 2 and work on block 1. Rinse and repeat until you have processed all the data.

As already stated the less data you have to read the less time will be lost to reading from disk so it may well be worth reading compressed data and spending the CPU time expanding each block on read. Equally compressing the block before write will save you disk time. Whether this is a win or not really will depend on how CPU intensive your processing of the data is.

Also if the processing on the blocks is asymmetric (ie processing block 1 can take 3 times as long as processing block 2) then consider triple or more buffering for reads.

like image 44
Goz Avatar answered Nov 17 '22 22:11

Goz


Get information about the volume you'll be writing to/reading from and create buffers that match the characteristics of the volume. e.g. 10 * clusterSize.

Buffering helps a lot, as would minimizing the amount of writing you have to do.

like image 31
James Avatar answered Nov 17 '22 23:11

James


As it was stated here, you should check size of block. You do this with stat family functions. In struct stat this information is located in field st_blksize.

Second thing is function posix_fadvise(), which gives advice to OS about paging. You tell system how you're going to use file (or even fragment of a file). You'll find more on manual page.

like image 27
Maciej Avatar answered Nov 18 '22 00:11

Maciej