I'm trying to parse a html file and I want to extract everything inside a outer div tag with a unique id. Sample:
<body>
...
<div id="1">
<div id="2">
...
</div>
<div id="3">
...
</div>
</div>
...
</body>
Here I want to extract every thing in between <div id="1"> and its corresponding </tag> NOT the first </div> tag.
I've gone through many older posts but they don't work because they stop when they see the first </div> tag which is not what I'm looking for.
Any pointer would be appreciated.
It sounds like your problem is that you are trying to parse HTML using regular expressions.
Don't. Use an HTML parser. There are plenty on CPAN. I'm fond of HTML::TreeBuilder::XPath.
Quentin has rightly mentioned using an HTML parser to extract div content. Here's one option using Mojo::DOM:
use strict;
use warnings;
use Mojo::DOM;
my $text = <<END;
<body>
...
<div id="1">
Under div id 1
<div id="2">
Under div id 2
</div>
<div id="3">
Under div id 3
</div>
</div>
Outside the divs
</body>
END
my $dom = Mojo::DOM->new($text);
print $dom->find('div[id=1]')->pluck('text');
Output:
Under div id 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With