CPU bound applications vs. IO bound

Tags:

For 'number-crunching' style applications that use alot of data (reads: "hundreds of MB, but not into GB" ie, it will fit nicely into memory beside the OS), does it make sense to read all your data into memory first before starting processing to avoid potentially making your program IO bound while reading large related datasets, instead loading them from RAM?

Does this answer change between using different data backings? ie, would the answer be the same irrespective of if you were using XML files, flat files, a full DBMS, etc?

875

asked Oct 26 '09 05:10

Matthew Scharley

2 Answers

Your program is as fast as whatever its bottleneck is. It makes sense to do things like storing your data in memory if that improves the overall performance. There is no hard and fast rule that says it will improve performance however. When you fix one bottleneck, something new becomes the bottleneck. So resolving one issue may get a 1% increase in performance or 1000% depending on what the next bottleneck is. The thing you're improving may still be the bottleneck.

I think about these things as generally fitting into one of three levels:

Eager. When you need something from disk or from a network or the result of a calculation you go and get or do it. This is the simplest to program, the easiest to test and debug but the worst for performance. This is fine so long as this aspect isn't the bottleneck;
Lazy. Once you've done a particular read or calculation don't do it again for some period of time that may be anything from a few milliseconds to forever. This can add a lot of complexity to your program but if the read or calculation is expensive, can reap enormous benefits; and
Over-eager. This is much like a combination of the previous two. Results are cached but instead of doing the read or calculation or requested there is a certain amount of preemptive activity to anticipate what you might want. Like if you read 10K from a file, there is a reasonably high likelihood that you might later want the next 10K block. Rather than delay execution you get it just in case it's requested.

The lesson to take from this is the (somewhat over-used and often mis-quoted) quote from Donald Knuth that "premature optimization is the root of all evil." Eager and over-eager solutions add a huge amount of complexity so there is no point doing them for something that won't yield a useful benefit.

Programmers often make the mistake of creating some highly (alleged) optimized version of something before determining if they need to and whether or not it will be useful.

My own take on this is: don't solve a problem until you have a problem.

117

answered Dec 18 '22 23:12

cletus

I would guess that choosing the right data storage method will have more effect than whether you read from disk all at once or as needed.

Most database tables have regular offsets for fields in each row. For example, a customer record may be 50 bytes long and have a pants_size column start at the 12th byte. Selecting all pants sizes is as easy as getting values at offsets 12, 62, 112, 162, ad nauseum.

XML, however, is a lousy format for fast data access. You'll need to slog through a bunch of variable-length tags and attributes in order to get your data, and you won't be able to jump instantly from one record to the next. Unless you parse the file into a data structure like the one mentioned above. In which case you'd have something very much like an RDMS, so there you go.

answered Dec 18 '22 21:12

Brendan Berg

Related questions
                            
                                What is rc stands for
                            
                                Can factory methods return multiple instances?
                            
                                Why don't RDBMS support array types for columns?
                            
                                Linkedlist keep track of min in constant time?
                            
                                Moving a bit within a byte using bitfield or bitwise operators
                            
                                Number distribution
                            
                                Generate a visual representation from a table with relation weight
                            
                                What are better ways to create a method that takes many arguments? (10+?)
                            
                                Programming Puzzle: How to paint a board?
                            
                                Is there a specialized algorithm, faster than quicksort, to reorder data ACEGBDFH?
                            
                                how can i find top 10 hashtags from stream of billion tweets
                            
                                Prevent TCP socket connection retries
                            
                                Deflate length of 258 double encoding
                            
                                Algorithm improvement for enumerating binary trees
                            
                                Optimal multiplayer maze generation algorithm
                            
                                What techniques have you actually used successfully to improve code coverage?
                            
                                Everything is a flow?
                            
                                Long running RTS game multiplayer considerations
                            
                                Can a test class become a "God object"?
                            
                                Should I change code to make it more testable?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

CPU bound applications vs. IO bound

Tags:

language-agnostic

data-processing

Matthew Scharley

People also ask

2 Answers

cletus

Brendan Berg

Recent Activity

Donate For Us