Task: Process 3 text files of close to 1GB size and turn them into csv files. The source files have a custom structure, so regular expressions would be useful.
Problem: There is no problem. I use php for it and it's fine. I don't actually need to process the files faster. I'm just curious how you would approach the problem in general. In the end i'd like to see simple and convenient solutions that might perform faster than php.
@felix I'm sure about that. :) If i'm done with the whole project i'll probably post this as cross language code ping pong.
@mark My approach currently works like that, with the exception that i cache few hundred lines to keep file writes low. An well thought through memory trade off would probably squeeze out some time. But i'm sure that other approaches can beat php by far, like a full utilization of a *nix toolset.
Firstly it probably doesn't really matter much which language you use for this as it probably will be I/O bound. What is more important is that you use an efficient approach / algorithm. In particular you want to avoid reading the entire file into memory if possible, and avoid concatenating the result into a huge string before writing it to disk.
Instead use a streaming approach: read a line of input, process it, then write a line of output.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With