Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient non-greedy method of returning multiple lines between patterns

Tags:

bash

sed

awk

I have a file like this:

bar 1
 foo 1
  how now
  manchu 50
 foo 2
  brown cow
  manchu 55
 foo 3
  the quick brown
  manchu 1
bar 2
 foo 1
  fox jumped
  manchu 8
 foo 2
  over the
  manchu 20
 foo 3
  lazy dog
  manchu 100
 foo 4
  manchu 5
 foo 5
  manchu 7
bar 3
bar 4

I want to search 'manchu 55' and receive:

FOONUMBER=2

(The foo # above 'manchu 55')

BARNUMBER=1

(The bar # above that foo)

PHRASETEXT="brown cow"

(The text on the line above 'manchu 55')

So I can ultimately output:

brown cow, bar 1, foo 2.

Thus far I've accomplished this with some really ugly grep code like:

FOONUMBER=`grep -e "manchu 55" -e ^" foo" -e ^"bar" | grep -B 1 "manchu 55" | grep "foo" | awk '{print $2}'`

BARNUMBER=`grep -e ^" foo $FOONUMBER" -e ^"bar" | grep -B 1 "foo $FOONUMBER" | grep "bar" | awk '{print $2}'`

PHRASETEXT=`grep -B 1 "manchu 55" | grep -v "manchu 55"`

There are 3 problems with this code:

  • It makes me cringe because I know it's bad
  • It's slow; I have to go through hundreds of thousands of entries and it's taking too long
  • sometimes, as in bar 2, foo 4 and 5 in my example, there is no text above the 'manchu'. In this case, it incorrectly returns a foo, which is not what I want.

I suspected I could do this with sed, doing something like:

FOONUMBER=`sed -n '/foo/,/manchu 55/p' | grep foo | awk '{print $2}'

Unfortunately sed is too greedy. I've been reading on AWK and state machines, which seems like it might be a better way to do this, but I still don't understand it well enough to set it up.

As you may have been able to determine by now, programming is not what I do for a living, but ultimately I have had this thrust upon me. I'm hoping to rewrite what I already have to be more efficient and hopefully not too complicated as some other poor sod without a programming degree will probably end up having to support any changes to it at some future date.

like image 553
Eleck Avatar asked Feb 11 '23 00:02

Eleck


2 Answers

with awk:

awk -v nManchu=55 -v OFS=", " '
  $1 == "bar" {bar = $0}    # store the most recently seen "bar" line
  $1 == "foo" {foo = $0}    # store the most recently seen "foo" line 
  $1 == "manchu" && $2 == nManchu {print prev, bar, foo} 
  {prev = $0}               # remember the previous line
' file

outputs

  brown cow, bar 1,  foo 2

Running with "nManchu=100" outputs

  lazy dog, bar 2,  foo 3

This has the advantage of only taking a single pass through the file, instead of parsing the file 3 times to get "bar", "foo" and the prev line.

like image 100
glenn jackman Avatar answered Feb 23 '23 13:02

glenn jackman


I would suggest

sed -n '/foo/ { s/.*foo\s*//; h }; /manchu 55/ { x; p }' filename

This is very simple:

/foo/ {         # if you see a line with "foo" in it,
  s/.*foo\s*//  # isolate the number
  h             # and put it in the hold buffer
}
/manchu 55/ {   # if you see a line with "manchu 55" in it,
  x             # exchange hold buffer and pattern space
  p             # and print the pattern space.
}

This will then print the last number seen after a foo before the manchu 55 line. The bar number can be extracted essentially the same way, and for the phrase text you could use

 sed -n '/manchu 55/ { x; p }; h'

to get the line held before manchu 55 is seen. Or possibly

 sed -n '/manchu 55/ { x; p }; s/^\s*//; h'

to remove leading white spaces in such a line.

If you are certain that only one manchu 55 line exists in the file or you only want the first match, you can replace x; p with x; p; q. The q will then quit directly after the result is printed.

like image 32
Wintermute Avatar answered Feb 23 '23 12:02

Wintermute