Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse HTML snippet with awk

Tags:

bash

awk

I am trying to parse an HTML document with awk.

The document contains several <div class="p_header_bottom"></div blocks

 <div class="p_header_bottom">
    <span class="fl_r"></span>
    287,489 people
  </div>
  <div class="p_header_bottom">
    <span class="fl_r"></span>
    5 links
  </div>

I am using

awk '/<div class="p_header_bottom">/,/<\/div>/'

to receive all such div's.

How I can get 287,489 number from first one?

Actually awk '/<\/span>/,/people/' doesn't work correctly.

like image 845
zavg Avatar asked Nov 18 '25 11:11

zavg


1 Answers

With gawk, and assuming that the only digits and commas within each <div> </div> block occur in the numeric portion of interest

awk -v RS='<[/]?div[^>]*>' '/span/ && /people/{gsub(/[^[:digit:],]/, ""); print}' file.txt
like image 142
iruvar Avatar answered Nov 21 '25 01:11

iruvar



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!