How to extract div tag

Question

I'm trying to parse a html file and I want to extract everything inside a outer div tag with a unique id. Sample:

<body>
  ...
  <div id="1">

    <div id="2">
    ...
    </div>

    <div id="3">
    ...
    </div>

  </div>
  ...
</body>

Here I want to extract every thing in between <div id="1"> and its corresponding </tag> NOT the first </div> tag.

I've gone through many older posts but they don't work because they stop when they see the first </div> tag which is not what I'm looking for.

Any pointer would be appreciated.

Quentin · Accepted Answer

It sounds like your problem is that you are trying to parse HTML using regular expressions.

Don't. Use an HTML parser. There are plenty on CPAN. I'm fond of HTML::TreeBuilder::XPath.

Kenosis · Answer

Quentin has rightly mentioned using an HTML parser to extract div content. Here's one option using Mojo::DOM:

use strict;
use warnings;
use Mojo::DOM;

my $text = <<END;
<body>
  ...
  <div id="1">
Under div id 1
    <div id="2">
Under div id 2
    </div>

    <div id="3">
Under div id 3
    </div>

  </div>
Outside the divs
</body>
END

my $dom = Mojo::DOM->new($text);

print $dom->find('div[id=1]')->pluck('text');

Output:

Under div id 1

How to extract div tag

Tags:

parsing

perl

gameover

2 Answers

Quentin

Kenosis

Recent Activity

Donate For Us

How to extract div tag

Tags:

parsing

perl

gameover

2 Answers

Quentin

Kenosis

Related questions

Recent Activity

Donate For Us