Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grepping progressively through large file

Tags:

grep

shell

unix

I have several large data files (~100MB-1GB of text) and a sorted list of tens of thousands of timestamps that index data points of interest. The timestamp file looks like:

12345
15467
67256
182387
199364
...

And the data file looks like:

Line of text
12345 0.234 0.123 2.321
More text
Some unimportant data
14509 0.987 0.543 3.600
More text
15467 0.678 0.345 4.431

The data in the second file is all in order of timestamp. I want to grep through the second file using the time stamps of the first, printing the timestamp and fourth data item in an output file. I've been using this:

grep -wf time.stamps data.file | awk '{print $1 "\t" $4 }'  >> output.file

This is taking on the order of a day to complete for each data file. The problem is that this command searches though the entire data file for every line in time.stamps, but I only need the search to pick up from the last data point. Is there any way to speed up this process?

like image 679
user2548142 Avatar asked Jul 03 '13 21:07

user2548142


1 Answers

You can do this entirely in awk

awk 'NR==FNR{a[$1]++;next}($1 in a){print $1,$4}' timestampfile datafile
like image 103
jaypal singh Avatar answered Sep 25 '22 16:09

jaypal singh