Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract a specific pattern from lines with sed, awk or perl

Tags:

grep

sed

awk

perl

nawk

Can I use sed if I need to extract a pattern enclosed by a specific pattern, if it exists in a line?

Suppose I have a file with the following lines :

There are many who dare not kill themselves for [/fear/] of what the neighbors will say.

Advice is what we ask for when we already know the /* answer */ but wish we didn’t.

In both the cases I have to scan the line for the first occurring pattern i.e ' [/ ' or '/* ' in their respective cases and store the following pattern till then exit pattern i.e ' /] 'or ' */ ' respectively .

In short , I need fear and answer .If possible , Can it be extended for multiple lines ;in the sense ,if the exit pattern occurs in a line different than the same .

Any kind of help in the form of suggestions or algorithms are welcome. Thanks in advance for the replies

like image 882
Gil Avatar asked Jun 19 '12 14:06

Gil


3 Answers

use strict;
use warnings;

while (<DATA>) {
    while (m#/(\*?)(.*?)\1/#g) {
        print "$2\n";
    }
}


__DATA__
There are many who dare not kill themselves for [/fear/] of what the neighbors will say.
Advice is what we ask for when we already know the /* answer */ but wish we didn’t.

As a one-liner:

perl -nlwe 'while (m#/(\*?)(.*?)\1/#g) { print $2 }' input.txt

The inner while loop will iterate between all matches with the /g modifier. The backreference \1 will make sure we only match identical open/close tags.

If you need to match blocks that extend over multiple lines, you need to slurp the input:

use strict;
use warnings;

$/ = undef;
while (<DATA>) {
    while (m#/(\*?)(.*?)\1/#sg) {
        print "$2\n";
    }
}

__DATA__
    There are many who dare not kill themselves for [/fear/] of what the neighbors will say. /* foofer */ 
    Advice is what we ask for when we already know the /* answer */ but wish we didn’t.
foo bar /
baz 
baaz / fooz

One-liner:

perl -0777 -nlwe 'while (m#/(\*?)(.*?)\1/#sg) { print $2 }' input.txt

The -0777 switch and $/ = undef will cause file slurping, meaning all of the file is read into a scalar. I also added the /s modifier to allow the wildcard . to match newlines.

Explanation for the regex: m#/(\*?)(.*?)\1/#sg

m#              # a simple m//, but with # as delimiter instead of slash
    /(\*?)      # slash followed by optional *
        (.*?)   # shortest possible string of wildcard characters
    \1/         # backref to optional *, followed by slash
#sg             # s modifier to make . match \n, and g modifier 

The "magic" here is that the backreference requires a star * only when one is found before it.

like image 50
TLP Avatar answered Nov 17 '22 00:11

TLP


Quick and dirty way in awk

awk 'NF{ for (i=1;i<=NF;i++) if($i ~ /^\[\//) { print gensub (/^..(.*)..$/,"\\1","g",$i); } else if ($i ~ /^\/\*/) print $(i+1);next}1' input_file

Test:

$ cat file
There are many who dare not kill themselves for [/fear/] of what the neighbors will say.

Advice is what we ask for when we already know the /* answer */ but wish we didn't.
$ awk 'NF{ for (i=1;i<=NF;i++) if($i ~ /^\[\//) { print gensub (/^..(.*)..$/,"\\1","g",$i); } else if ($i ~ /^\/\*/) print $(i+1);next}1' file
fear

answer
like image 36
jaypal singh Avatar answered Nov 16 '22 23:11

jaypal singh


Single-Line Matches

If you really want to do this in sed, you can extract your delimited patterns relatively easily as long as they are on the same line.

# Using GNU sed. Escape a whole lot more if your sed doesn't handle
# the -r flag.
sed -rn 's![^*/]*(/\*?.*/).*!\1!p' /tmp/foo

Multi-Line Matches

If you want to perform multi-line matches with sed, things get a little uglier. However, it can certainly be done.

# Multi-line matching of delimiters with GNU sed.
sed -rn ':loop
         /\/[^\/]/ { 
             N
             s![^*/]+(/\*?.*\*?/).*!\1!p
             T loop
         }' /tmp/foo

The trick is to look for a starting delimiter, then keep appending lines in a loop until you find the ending delimiter.

This works really well as long as you really do have an ending delimiter. Otherwise, the contents of the file will keep being appended to the pattern space until sed finds one, or until it reaches the end of the file. This may cause problems with certain versions of sed or with really, really large files where the size of the pattern space gets out of hand.

See GNU sed's Limitations and Non-limitations for more information.

like image 1
Todd A. Jacobs Avatar answered Nov 16 '22 23:11

Todd A. Jacobs