Can I use sed
if I need to extract a pattern enclosed by a specific pattern, if it exists in a line?
Suppose I have a file with the following lines :
There are many who dare not kill themselves for [/fear/] of what the neighbors will say.
Advice is what we ask for when we already know the
/*
answer*/
but wish we didn’t.
In both the cases I have to scan the line for the first occurring pattern i.e ' [/
' or '/*
' in their respective cases and store the following pattern till then exit pattern i.e ' /
] 'or ' */
' respectively .
In short , I need fear
and answer
.If possible , Can it be extended for multiple lines ;in the sense ,if the exit pattern occurs in a line different than the same .
Any kind of help in the form of suggestions or algorithms are welcome. Thanks in advance for the replies
use strict;
use warnings;
while (<DATA>) {
while (m#/(\*?)(.*?)\1/#g) {
print "$2\n";
}
}
__DATA__
There are many who dare not kill themselves for [/fear/] of what the neighbors will say.
Advice is what we ask for when we already know the /* answer */ but wish we didn’t.
As a one-liner:
perl -nlwe 'while (m#/(\*?)(.*?)\1/#g) { print $2 }' input.txt
The inner while loop will iterate between all matches with the /g
modifier. The backreference \1
will make sure we only match identical open/close tags.
If you need to match blocks that extend over multiple lines, you need to slurp the input:
use strict;
use warnings;
$/ = undef;
while (<DATA>) {
while (m#/(\*?)(.*?)\1/#sg) {
print "$2\n";
}
}
__DATA__
There are many who dare not kill themselves for [/fear/] of what the neighbors will say. /* foofer */
Advice is what we ask for when we already know the /* answer */ but wish we didn’t.
foo bar /
baz
baaz / fooz
One-liner:
perl -0777 -nlwe 'while (m#/(\*?)(.*?)\1/#sg) { print $2 }' input.txt
The -0777
switch and $/ = undef
will cause file slurping, meaning all of the file is read into a scalar. I also added the /s
modifier to allow the wildcard .
to match newlines.
Explanation for the regex: m#/(\*?)(.*?)\1/#sg
m# # a simple m//, but with # as delimiter instead of slash
/(\*?) # slash followed by optional *
(.*?) # shortest possible string of wildcard characters
\1/ # backref to optional *, followed by slash
#sg # s modifier to make . match \n, and g modifier
The "magic" here is that the backreference requires a star *
only when one is found before it.
Quick and dirty way in awk
awk 'NF{ for (i=1;i<=NF;i++) if($i ~ /^\[\//) { print gensub (/^..(.*)..$/,"\\1","g",$i); } else if ($i ~ /^\/\*/) print $(i+1);next}1' input_file
$ cat file
There are many who dare not kill themselves for [/fear/] of what the neighbors will say.
Advice is what we ask for when we already know the /* answer */ but wish we didn't.
$ awk 'NF{ for (i=1;i<=NF;i++) if($i ~ /^\[\//) { print gensub (/^..(.*)..$/,"\\1","g",$i); } else if ($i ~ /^\/\*/) print $(i+1);next}1' file
fear
answer
If you really want to do this in sed, you can extract your delimited patterns relatively easily as long as they are on the same line.
# Using GNU sed. Escape a whole lot more if your sed doesn't handle
# the -r flag.
sed -rn 's![^*/]*(/\*?.*/).*!\1!p' /tmp/foo
If you want to perform multi-line matches with sed, things get a little uglier. However, it can certainly be done.
# Multi-line matching of delimiters with GNU sed.
sed -rn ':loop
/\/[^\/]/ {
N
s![^*/]+(/\*?.*\*?/).*!\1!p
T loop
}' /tmp/foo
The trick is to look for a starting delimiter, then keep appending lines in a loop until you find the ending delimiter.
This works really well as long as you really do have an ending delimiter. Otherwise, the contents of the file will keep being appended to the pattern space until sed finds one, or until it reaches the end of the file. This may cause problems with certain versions of sed or with really, really large files where the size of the pattern space gets out of hand.
See GNU sed's Limitations and Non-limitations for more information.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With