Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract div tag

Tags:

parsing

perl

I'm trying to parse a html file and I want to extract everything inside a outer div tag with a unique id. Sample:

<body>
  ...
  <div id="1">

    <div id="2">
    ...
    </div>

    <div id="3">
    ...
    </div>

  </div>
  ...
</body>

Here I want to extract every thing in between <div id="1"> and its corresponding </tag> NOT the first </div> tag.

I've gone through many older posts but they don't work because they stop when they see the first </div> tag which is not what I'm looking for.

Any pointer would be appreciated.

like image 842
gameover Avatar asked Nov 20 '25 01:11

gameover


2 Answers

It sounds like your problem is that you are trying to parse HTML using regular expressions.

Don't. Use an HTML parser. There are plenty on CPAN. I'm fond of HTML::TreeBuilder::XPath.

like image 138
Quentin Avatar answered Nov 21 '25 22:11

Quentin


Quentin has rightly mentioned using an HTML parser to extract div content. Here's one option using Mojo::DOM:

use strict;
use warnings;
use Mojo::DOM;

my $text = <<END;
<body>
  ...
  <div id="1">
Under div id 1
    <div id="2">
Under div id 2
    </div>

    <div id="3">
Under div id 3
    </div>

  </div>
Outside the divs
</body>
END

my $dom = Mojo::DOM->new($text);

print $dom->find('div[id=1]')->pluck('text');

Output:

Under div id 1
like image 37
Kenosis Avatar answered Nov 22 '25 00:11

Kenosis



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!