Using sed and/or awk, I'd like to be able to delete a line only if it contains the string "foo" AND the lines before and after contain the strings "bar" and "baz" respectively.
So for this input:
blah
blah
foo
blah
bar
foo
baz
blah
we would delete the second foo but nothing else, leaving:
blah
blah
foo
blah
bar
baz
blah
I've tried using a while loop to read the file line by line, but this is slow and I can't work out how to match the previous and next lines.
Edit - as requested in a comment, this is the current state of my while loop. Currently only matches the previous line (stored from the previous loop as $linepre).
linepre=0
while read line
do
if [ $line != foo ] && [ $linepre != bar ]
then
echo $line
fi
linepre=$line
done < foobarbaz.txt
Pretty ugly.
For an elegant perl
solution see Sundeep's answer.
For a similar and very nice sed
solution see potong's second answer
Both solutions read the file completely into memory and process it in one go. This is fine if you don't need to process GB file sizes. In other words, these are the best solutions (if we ignore CASE3
).
comment: both solutions fail CASE3
(see below). CASE3
is an exceptional debatable case.
Update 1: the following awk
solution is a new script which works in all cases. Earlier solutions, for which this answer got accepted failed on particular cases. The presented solution solves the nested grouping (CASE3
below):
awk 'BEGIN{p=1;l1=l2=""}
(NR>2) && p {print l1}
{ p=!(l1~/bar/&&l2~/foo/&&/baz/);
l1=l2;l2=$0
}
END{if (l1!="" && p) print l1
if (l2!="" ) print l2}' <file>
To solve the problem, we constantly buffer 3 lines stored in l1
, l2
and $0
. Each processing of a new line, we determine if l1
should be printed or not in the next cycle and swap the buffered lines. The printing starts only from NR=3
onward. The condition to print is if l1
contains bar
, l2
contains foo
and $0
contains baz
, then we do not print in the next cycle.
Update 2: A sed
solution based on the same principle can be obtained. sed
has two memories. The pattern space is where you do all operations on and the hold space is a long term memory. The idea is to put the word print
in the hold space, but we can only do this by swapping the spaces around (using x
)
sed '1{x;s/^.*$/print/;x;N}; #1
N; #2
x;/print/{z;x;P;x};x; #3
/bar.*\n.*foo.*\n.*baz/!{x;s/^.*$/print/;x}; #4
$s/\(bar.*\)\n.*foo.*\n\(.*baz\)/\1\n\2/; #5
D' <file> #6
#1
initializes the state by placing the word print
in the hold space (x;s...;x
)and append another line to the pattern space (N
)#2
adds the third line to the pattern space#3
determines if we need to print the first line of the pattern space by checking the hold space and delete the hold space P
prints upto the first \n
in the pattern space and z
zaps the pattern space#4
determines if we should print in the next cycle. checks if the real pattern matches, if not put the word print
in the hold space#5
, is the end-of-file condition#6
deletes upto the first \n
in the pattern space and goes back to #1
without reading a new line.At exit, the pattern-space is printed again.
comment: if you want to see how the pattern space and hold space look like, you can add after each line the following code: s/^/P:/;l;s/^P://;x;s/^/H:/;l;s/^H://;x
. This line will print both spaces with P:
respectively H:
in front.
Used test file:
# bar-foo-baz test file
# An asterisk indicates the foo
# lines that should be removed
<CASE0 :: default case>
bar
foo (*)
baz
<CASE1 :: reset cycle on second line>
bar
foobar
foo (*)
baz
<CASE2 :: start cycle at end of previous cycle>
bar
foo (*)
bazbar
foo (*)
baz
<CASE3 :: nested cases>
bar
foobar (*)
foobaz (*)
baz
<CASE4 :: end-of-file case>
bar
foo
Formerly accepted answer: (updated to indicate which cases fail)
awk
: fails CASE3
awk '!/baz/&&(c==2){print foo}
/bar/ {c=1;print;next}
/foo/ &&(c==1){c++;foo=$0;next}
{c=0;print}
END{if(c==2){print foo}}' <file>
This solution prints all lines by default, except if the line contains foo
which comes after a line containing bar
. The logic above just decides if we should print the line foo
or not.
!/baz/&&(c==2){print foo}
: this solves early termination. If no baz
is found after a valid bar-foo
combination, it prints the foo
line.
/bar/{c=1;print;next}
: this initialises the start of a new cycle. If bar
is found, set c
to 1
, print the line and move to the next line. bar
lines are always printed. This line resolves CASE1
and CASE2
.
/foo/&&(c==1){c++;foo=$0;next}
: this checks the bar-foo
combination. It stores the the foo
line and moves to the next line.
{c=0;print}
, if we reached this point, it implies that we did not find a bar
line or a bar-foo
combination. Just print the line by default and reset the counter to zero.
END{if(c==2){print foo}}
this statement just solves CASE4
gawk
: fails CASE3
awk 'BEGIN{ORS="";RS="bar[^\n]*\n[^\n]*foo[^\n]*\n[^\n]*baz"}
{sub(/\n[^\n]*foo[^\n]*\n/,"\n",RT); print $0 RT}' <file>
The RS
is set to bar[^\n]*\n[^\n]*foo[^\n]*\n[^\n]*baz
, i.e. the pattern we are interested in. Here, [^\n]*\n[^\n]*
represents a string containing a single \n
, thus the RS
represents valid bar-foo-baz
combination. The found record separator RT
is edited with sub
to remove the foo
line and printed after the found record.
RT
(gawk extension) The input text that matched the text denoted byRS
, the record separator. It is set every time a record is read.
sed
: fails CASE1, CASE2, CASE3, CASE4
sed -n '/bar/{N;/\n.*foo/{N;/foo.*\n.*baz[^\n]*$/{s/\n.*foo.*\n/\n/}}};p' <file>
/bar/{N;...}
if the line contains bar
, append the next line to the pattern buffer (N
)/\n.*foo/{N;...}
if the pattern buffer has foo
after a newline character, append the next line to the pattern buffer (N
)/foo.*\n.*baz[^\n]*$/{s/\n.*foo.*\n/\n/}
if the pattern buffer contains foo
followed by a single newline and ends with a line containing baz
, remove the line containing foo
. The search pattern here excludes cases as barfoo\nfoobaz\ncar
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With