I have a large number of (>100k) relatively small files (1kb - 300kb) that I need to read in and process. I'm currently looping through all the files and using File.ReadAllText
to read the content, processing it, and then reading the next file. This is quite slow and I was wondering if there is a good way to optimize it.
I have already tried using multiple threads but as this seems to be IO bound I didn't see any improvements.
You're most likely correct - Reading that many files is probably going to limit your potential speedups since the Disk I/O will be the limiting factor.
That being said, you very likely can do a small improvement by passing the processing of the data into a separate thread.
I would recommend trying to have a single "producer" thread that reads your files. This thread will be IO limited. As it reads a file, it can push the "processing" into a ThreadPool thread (.NET 4 tasks work great for this too) in order to do the processing, which would allow it to immediately read the next file.
This will at least take the "processing time" out of the total runtime, making the total time for your job nearly as fast as the Disk IO, provided you've got an extra core or two to work with...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With