Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

awk/sed/grep to delete lines matching fields in another file

Tags:

linux

bash

sed

awk

I have a file1, that has a few lines (tens), and a much longer file2 (~500,000 lines). The lines in each file are not identical, although there is a subset of fields that are identical. I want to take fields 3-5 from each line in file1, and search file2 for the same pattern (just those three fields, in same order -- in file2, they fall in fields 2-4). If any match is found, then I want to delete the corresponding line from file1.

Eg, file1:

2016-01-06T05:38:31 2016-01-06T05:23:33 2016006 120E A TM Current
2016-01-06T07:34:01 2016-01-06T07:01:51 2016006 090E B TM Current
2016-01-06T07:40:44 2016-01-06T07:40:41 2016006 080E A TM Alt
2016-01-06T07:53:50 2016-01-06T07:52:14 2016006 090E A TM Current
2016-01-06T08:14:45 2016-01-06T08:06:33 2016006 080E C TM Current

file2:

2016-01-06T07:35:06.87 2016003 100E C NN Current 0
2016-01-06T07:35:09.97 2016003 100E B TM Current 6303
2016-01-06T07:36:23.12 2016004 030N C TM Current 0
2016-01-06T07:37:57.36 2016006 090E A TM Current 399
2016-01-06T07:40:29.61 2016006 010N C TM Current 0

... (and on for 500,000 lines)

So in this case, I want to delete the fourth line of file1 (in place).

The following finds the lines I want to delete:

grep "$(awk '{print $3,$4,$5}' file1)" file2

So one solution may be to pipe this to sed, but I'm unclear how to set a match pattern in sed from a piped input. And searching online suggests awk can probably do all of this (or perhaps sed, or something else), so wondering what a clean solution would look like.

Also, speed is somewhat important because other processes may attempt to modify the files while this is going on (I know this may present more complications...). Matches will generally be found at the end of file2, not the beginning (in case there is some way to search file2 from the bottom up).

like image 878
trid3 Avatar asked Jan 06 '16 16:01

trid3


1 Answers

$ awk 'NR==FNR{file2[$2,$3,$4]; next} !(($3,$4,$5) in file2)' file2 file1
2016-01-06T05:38:31 2016-01-06T05:23:33 2016006 120E A TM Current
2016-01-06T07:34:01 2016-01-06T07:01:51 2016006 090E B TM Current
2016-01-06T07:40:44 2016-01-06T07:40:41 2016006 080E A TM Alt
2016-01-06T08:14:45 2016-01-06T08:06:33 2016006 080E C TM Current

The fact that file2 contains 500,000 lines should be no problem for awk wrt memory or execution speed - it should complete in about 1 second or less even in the worst case.

With any UNIX command, to overwrite the original file you just do:

cmd file > tmp && mv tmp file

so in this case:

awk '...' file2 file1 > tmp && mv tmp file1
like image 178
Ed Morton Avatar answered Oct 23 '22 04:10

Ed Morton