I am developing an application that reads lines from enormous text files (~2.5 GB), manipulates each line to a specific format, and then writes each line to a text file. Once the output text file has been closed, the program "Bulk Inserts" (SQL Server) the data into my database. It works, it's just slow.
I am using StreamReader
and StreamWriter
.
I'm pretty much stuck with reading one line at a time due to how I have to manipulate the text; however, I think that if I made a collection of lines and wrote out the collection every 1000 lines or so, it would speed things up at least a bit. The problem is (and this could be purely from my ignorance) that I cannot write a string[]
using StreamWriter
. After exploring StackOverflow and the rest of the internet, I came across File.WriteAllLines
, which allows me to write string[]
s to file, but I dont think my computer's memory can handle 2.5 GB of data being stored at one time. Also, the file is created, populated, and closed, so I would have to make a ton of smaller files to break down the 2 GB text files only to insert them into the database. So I would prefer to stay away from that option.
One hack job that I can think of is making a StringBuilder
and using the AppendLine
method to add each line to make a gigantic string. Then I could convert that StringBuilder
to a string and write it to file.
But enough of my conjecturing. The method I have already implemented works, but I am wondering if anyone can suggest a better way to write chunks of data to a file?
Two things will increase the speed of output using StreamWriter
.
First, make sure that the output file is on a different physical disk than the input file. If the input and output are on the same drive, then very often reads have to wait for writes and writes have to wait for reads. The disk can do only one thing at a time. Obviously not every read or write waits, because the StreamReader
reads into a buffer and parses lines out of it, and the StreamWriter
writes to a buffer and then pushes that to disk when the buffer is full. With the input and output files on separate drives, your reads and writes overlap.
What do I mean they overlap? The operating system will typically read ahead for you, so it can be buffering your file while you're processing. And when you do a write, the OS typically buffers that and writes it to the disk lazily. So there is some limited amount of asynchronous processing going on.
Second thing is to increase your buffer size. The default buffer size for StreamReader
and StreamWriter
is 4 kilobytes. So every 4K read or written incurs an operating system call. And, quite likely, a disk operation.
If you increase the buffer size to 64K, then you make 16 times fewer OS calls and 16 times fewer disk operations (not strictly true, but close). Going to a 64K buffer can cut more than 25% off your I/O time, and it's dead simple to do:
const int BufferSize = 64 * 1024;
var reader = new StreamReader(filename, Encoding.UTF8, true, BufferSize);
var writer = new StreamWriter(filename, Encoding.UTF8, BufferSize);
Those two things will speed your I/O more than anything else you can do. Trying to build buffers in memory using StringBuilder
is just unnecessary work that does a bad job of duplicating what you can achieve by increasing the buffer size, and done incorrectly can easily make your program slower.
I would caution against buffer sizes larger than 64 KB. On some systems, you get marginally better results with buffers up to 256 KB, but on others you get dramatically worse performance--to the tune of 50% slower! I've never seen a system perform better with buffers larger than 256 KB than they do with buffers of 64 KB. In my experience, 64 KB is the sweet spot.
One other thing you can do is use three threads: a reader, a processor, and a writer. They communicate with queues. This can reduce your total time from (input-time + process-time + output-time)
to something very close to max(input-time, process-time, output-time)
. And with .NET, it's really easy to set up. See my blog posts: Simple multithreading, Part 1 and Simple multithreading, Part 2.
According to the docs, StreamWriter
does not automatically flush after every write by default, so it is buffered.
You could also use some of the lazy methods on the File
class like so:
File.WriteAllLines("output.txt",
File.ReadLines("filename.txt").Select(ProcessLine));
where ProcessLine
is declared like so:
private string ProcessLine(string input) {
string result = // do some calculation on input
return result;
}
Since ReadLines
is lazy and WriteAllLines
has a lazy overload, it will stream the file rather than attempting to read the whole thing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With