I have a file1, that has a few lines (tens), and a much longer file2 (~500,000 lines). The lines in each file are not identical, although there is a subset of fields that are identical. I want to take fields 3-5 from each line in file1, and search file2 for the same pattern (just those three fields, in same order -- in file2, they fall in fields 2-4). If any match is found, then I want to delete the corresponding line from file1.
Eg, file1:
2016-01-06T05:38:31 2016-01-06T05:23:33 2016006 120E A TM Current
2016-01-06T07:34:01 2016-01-06T07:01:51 2016006 090E B TM Current
2016-01-06T07:40:44 2016-01-06T07:40:41 2016006 080E A TM Alt
2016-01-06T07:53:50 2016-01-06T07:52:14 2016006 090E A TM Current
2016-01-06T08:14:45 2016-01-06T08:06:33 2016006 080E C TM Current
file2:
2016-01-06T07:35:06.87 2016003 100E C NN Current 0
2016-01-06T07:35:09.97 2016003 100E B TM Current 6303
2016-01-06T07:36:23.12 2016004 030N C TM Current 0
2016-01-06T07:37:57.36 2016006 090E A TM Current 399
2016-01-06T07:40:29.61 2016006 010N C TM Current 0
... (and on for 500,000 lines)
So in this case, I want to delete the fourth line of file1 (in place).
The following finds the lines I want to delete:
grep "$(awk '{print $3,$4,$5}' file1)" file2
So one solution may be to pipe this to sed, but I'm unclear how to set a match pattern in sed from a piped input. And searching online suggests awk can probably do all of this (or perhaps sed, or something else), so wondering what a clean solution would look like.
Also, speed is somewhat important because other processes may attempt to modify the files while this is going on (I know this may present more complications...). Matches will generally be found at the end of file2, not the beginning (in case there is some way to search file2 from the bottom up).
$ awk 'NR==FNR{file2[$2,$3,$4]; next} !(($3,$4,$5) in file2)' file2 file1
2016-01-06T05:38:31 2016-01-06T05:23:33 2016006 120E A TM Current
2016-01-06T07:34:01 2016-01-06T07:01:51 2016006 090E B TM Current
2016-01-06T07:40:44 2016-01-06T07:40:41 2016006 080E A TM Alt
2016-01-06T08:14:45 2016-01-06T08:06:33 2016006 080E C TM Current
The fact that file2 contains 500,000 lines should be no problem for awk wrt memory or execution speed - it should complete in about 1 second or less even in the worst case.
With any UNIX command, to overwrite the original file you just do:
cmd file > tmp && mv tmp file
so in this case:
awk '...' file2 file1 > tmp && mv tmp file1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With