What's the best way to unit test large data sets? Some legacy code that I'm maintaining has structures of a hundred members or more; other parts of the code that we're working on create or analyze data sets of hundreds of samples.
The best approach I've found so far is to serialize the structures or data sets from disk, perform the operations under test, serialize the results to disk, then diff the files containing the serialized results against files containing expected results. This isn't terribly fast, and it violates the "don't touch the disk" principle of unit testing. However, the only alternative I can think of (writing code to initialize and test hundreds of members and data points) seems unbearably tedious.
Are there any better solutions?
If what you are trying to achieve is, in fact, a unit test you should mock out the underlying data structures and simulate the data. This technique gives you complete control over the inputs. For example, each test you write may handle a single data point and you'll have a very concise set of tests for each condition. There are several open source mocking frameworks out there, I personally recommend Rhino Mocks (http://ayende.com/projects/rhino-mocks/downloads.aspx) or NMock (http://www.nmock.org).
If it is not possible for you to mock out the data structures I recommend refactoring so you are able to :-) Its worth it! Or you may also want to try TypeMock (http://www.typemock.com/) which allows mocking of concrete classes.
If, however, if you're doing tests against large data sets you're really running functional tests not unit tests. In which case loading data into a database or from disk is a typical operation. Rather than avoid it you should work on getting it running in parallel with the rest of your automated build process so the performance impact isn't holding any of your developers up.
This is still a viable approach. Although, I would classify this as a functional test, or just not a pure unit test. A good unit test would be to take a sampling of those records that gives to a good distribution of the edge cases you may encounter, and write those up. Then, you have your last "acceptance" or "functional" test with your bulk test on all the data.
I have use this approach when testing large amounts of data, and i find it works well enough because the small units are maintainable, and then I know that the bulk test works, and it's all automatic.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With