Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How would you process 1GB of text data?

Tags:

regex

text

php

Task: Process 3 text files of close to 1GB size and turn them into csv files. The source files have a custom structure, so regular expressions would be useful.

Problem: There is no problem. I use php for it and it's fine. I don't actually need to process the files faster. I'm just curious how you would approach the problem in general. In the end i'd like to see simple and convenient solutions that might perform faster than php.

@felix I'm sure about that. :) If i'm done with the whole project i'll probably post this as cross language code ping pong.

@mark My approach currently works like that, with the exception that i cache few hundred lines to keep file writes low. An well thought through memory trade off would probably squeeze out some time. But i'm sure that other approaches can beat php by far, like a full utilization of a *nix toolset.

like image 425
c0rnh0li0 Avatar asked Feb 27 '23 04:02

c0rnh0li0


1 Answers

Firstly it probably doesn't really matter much which language you use for this as it probably will be I/O bound. What is more important is that you use an efficient approach / algorithm. In particular you want to avoid reading the entire file into memory if possible, and avoid concatenating the result into a huge string before writing it to disk.

Instead use a streaming approach: read a line of input, process it, then write a line of output.

like image 57
Mark Byers Avatar answered Mar 06 '23 17:03

Mark Byers