Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sed optimization (large file modification based on smaller dataset)

I do have to deal with very large plain text files (over 10 gigabytes, yeah I know it depends what we should call large), with very long lines.

My most recent task involves some line editing based on data from another file.

The data file (which should be modified) contains 1500000 lines, each of them are e.g. 800 chars long. Each line is unique, and contains only one identity number, each identity number is unique)

The modifier file is e.g. 1800 lines long, contains an identity number, and an amount and a date which should be modified in the data file.

I just transformed (with Vim regex) the modifier file to sed, but it's very inefficient.

Let's say I have a line like this in the data file:

(some 500 character)id_number(some 300 character)

And I need to modify data in the 300 char part.

Based on the modifier file, I come up with sed lines like this:

/id_number/ s/^\(.\{650\}\).\{20\}/\1CHANGED_AMOUNT_AND_DATA/

So I have 1800 lines like this.

But I know, that even on a very fast server, if I do a

sed -i.bak -f modifier.sed data.file

It's very slow, because it has to read every pattern x every line.

Isn't there a better way?

Note: I'm not a programmer, had never learnt (in school) about algorithms. I can use awk, sed, an outdated version of perl on the server.

like image 626
Zsolt Botykai Avatar asked May 11 '09 16:05

Zsolt Botykai


1 Answers

My suggested approaches (in order of desirably) would be to process this data as:

  1. A database (even a simple SQLite-based DB with an index will perform much better than sed/awk on a 10GB file)
  2. A flat file containing fixed record lengths
  3. A flat file containing variable record lengths

Using a database takes care of all those little details that slow down text-file processing (finding the record you care about, modifying the data, storing it back to the DB). Take a look for DBD::SQLite in the case of Perl.

If you want to stick with flat files, you'll want to maintain an index manually alongside the big file so you can more easily look up the record numbers you'll need to manipulate. Or, better yet, perhaps your ID numbers are your record numbers?

If you have variable record lengths, I'd suggest converting to fixed-record lengths (since it appears only your ID is variable length). If you can't do that, perhaps any existing data will not ever move around in the file? Then you can maintain that previously mentioned index and add new entries as necessary, with the difference is that instead of the index pointing to record number, you now point to the absolute position in the file.

like image 51
MikeyB Avatar answered Sep 19 '22 11:09

MikeyB