Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Delete line if line matches "foo", line above matches "bar" and line below matches "baz"?

Tags:

sed

awk

Using sed and/or awk, I'd like to be able to delete a line only if it contains the string "foo" AND the lines before and after contain the strings "bar" and "baz" respectively.

So for this input:

blah
blah
foo
blah
bar
foo
baz
blah

we would delete the second foo but nothing else, leaving:

blah
blah
foo
blah
bar
baz
blah

I've tried using a while loop to read the file line by line, but this is slow and I can't work out how to match the previous and next lines.

Edit - as requested in a comment, this is the current state of my while loop. Currently only matches the previous line (stored from the previous loop as $linepre).

linepre=0 
while read line
do 
   if [ $line != foo ] && [ $linepre != bar ]
   then 
       echo $line
   fi
linepre=$line
done < foobarbaz.txt

Pretty ugly.

like image 860
birac Avatar asked Dec 11 '22 07:12

birac


1 Answers

For an elegant perl solution see Sundeep's answer.

For a similar and very nice sed solution see potong's second answer

Both solutions read the file completely into memory and process it in one go. This is fine if you don't need to process GB file sizes. In other words, these are the best solutions (if we ignore CASE3).

comment: both solutions fail CASE3 (see below). CASE3 is an exceptional debatable case.


Update 1: the following awk solution is a new script which works in all cases. Earlier solutions, for which this answer got accepted failed on particular cases. The presented solution solves the nested grouping (CASE3 below):

awk 'BEGIN{p=1;l1=l2=""}
     (NR>2) && p {print l1}
     { p=!(l1~/bar/&&l2~/foo/&&/baz/);
       l1=l2;l2=$0
     }
     END{if (l1!="" && p) print l1
         if (l2!=""     ) print l2}' <file>

To solve the problem, we constantly buffer 3 lines stored in l1, l2 and $0. Each processing of a new line, we determine if l1 should be printed or not in the next cycle and swap the buffered lines. The printing starts only from NR=3 onward. The condition to print is if l1 contains bar, l2 contains foo and $0 contains baz, then we do not print in the next cycle.

Update 2: A sed solution based on the same principle can be obtained. sed has two memories. The pattern space is where you do all operations on and the hold space is a long term memory. The idea is to put the word print in the hold space, but we can only do this by swapping the spaces around (using x)

 sed '1{x;s/^.*$/print/;x;N};                           #1
      N;                                                #2
      x;/print/{z;x;P;x};x;                             #3
      /bar.*\n.*foo.*\n.*baz/!{x;s/^.*$/print/;x};      #4
      $s/\(bar.*\)\n.*foo.*\n\(.*baz\)/\1\n\2/;         #5
      D' <file>                                         #6
  • line #1 initializes the state by placing the word print in the hold space (x;s...;x)and append another line to the pattern space (N)
  • line #2 adds the third line to the pattern space
  • line #3 determines if we need to print the first line of the pattern space by checking the hold space and delete the hold space P prints upto the first \n in the pattern space and z zaps the pattern space
  • line #4 determines if we should print in the next cycle. checks if the real pattern matches, if not put the word print in the hold space
  • line #5, is the end-of-file condition
  • line #6 deletes upto the first \n in the pattern space and goes back to #1 without reading a new line.

At exit, the pattern-space is printed again.

comment: if you want to see how the pattern space and hold space look like, you can add after each line the following code: s/^/P:/;l;s/^P://;x;s/^/H:/;l;s/^H://;x. This line will print both spaces with P: respectively H: in front.

Used test file:

# bar-foo-baz test file
# An asterisk indicates the foo
# lines that should be removed
<CASE0 :: default case>
bar
foo (*)
baz
<CASE1 :: reset cycle on second line>
bar
foobar
foo (*)
baz
<CASE2 :: start cycle at end of previous cycle>
bar
foo (*)
bazbar
foo (*)
baz
<CASE3 :: nested cases>
bar
foobar (*)
foobaz (*)
baz
<CASE4 :: end-of-file case>
bar
foo

Formerly accepted answer: (updated to indicate which cases fail)

awk: fails CASE3

awk '!/baz/&&(c==2){print foo}
     /bar/         {c=1;print;next}
     /foo/ &&(c==1){c++;foo=$0;next}
                   {c=0;print}
     END{if(c==2){print foo}}' <file>

This solution prints all lines by default, except if the line contains foo which comes after a line containing bar. The logic above just decides if we should print the line foo or not.

  • !/baz/&&(c==2){print foo} : this solves early termination. If no baz is found after a valid bar-foocombination, it prints the fooline.

  • /bar/{c=1;print;next} : this initialises the start of a new cycle. If bar is found, set c to 1, print the line and move to the next line. barlines are always printed. This line resolves CASE1 and CASE2.

  • /foo/&&(c==1){c++;foo=$0;next} : this checks the bar-foocombination. It stores the the fooline and moves to the next line.

  • {c=0;print}, if we reached this point, it implies that we did not find a barline or a bar-foocombination. Just print the line by default and reset the counter to zero.

  • END{if(c==2){print foo}} this statement just solves CASE4

gawk: fails CASE3

awk 'BEGIN{ORS="";RS="bar[^\n]*\n[^\n]*foo[^\n]*\n[^\n]*baz"}
     {sub(/\n[^\n]*foo[^\n]*\n/,"\n",RT); print $0 RT}' <file>

The RS is set to bar[^\n]*\n[^\n]*foo[^\n]*\n[^\n]*baz, i.e. the pattern we are interested in. Here, [^\n]*\n[^\n]* represents a string containing a single \n, thus the RS represents valid bar-foo-baz combination. The found record separator RT is edited with sub to remove the fooline and printed after the found record.

RT (gawk extension) The input text that matched the text denoted by RS, the record separator. It is set every time a record is read.

sed: fails CASE1, CASE2, CASE3, CASE4

sed -n '/bar/{N;/\n.*foo/{N;/foo.*\n.*baz[^\n]*$/{s/\n.*foo.*\n/\n/}}};p' <file>
  • /bar/{N;...} if the line contains bar, append the next line to the pattern buffer (N)
  • /\n.*foo/{N;...} if the pattern buffer has foo after a newline character, append the next line to the pattern buffer (N)
  • /foo.*\n.*baz[^\n]*$/{s/\n.*foo.*\n/\n/} if the pattern buffer contains foo followed by a single newline and ends with a line containing baz, remove the line containing foo. The search pattern here excludes cases as barfoo\nfoobaz\ncar
like image 128
kvantour Avatar answered Dec 12 '22 20:12

kvantour