Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I remove every third HTML tag in Perl?

Tags:

html

regex

perl

This is a quickly cooked up script, but I am having some difficulty due to unfamiliarity with regexes and Perl.

The script is supposed to read in an HTML file. There is a place in the file (by itself) where I have a bunch of <div>s. I want to remove every third of them -- they are grouped in fours.

My script below won't compile, let alone run.

#!/usr/bin/perl
use warnings;
use strict;


&remove();

sub remove {
    my $input = $ARGV[0];
    my $output = $ARGV[1];
    open INPUT, $input or die "couldn't open file $input: $!\n";
    open OUTPUT, ">$output" or die "couldn't open file $output: $!\n";

    my @file = <INPUT>;
    foreach (@file) {
        my $int = 0;
        if ($_ =~ '<div class="cell">') {
        $int++;
        { // this brace was the wrong way
        if ($int % 4 == 3) {
        $_ =~ '/s\<div class="cell">\+.*<\/div>/;/g';
            }
    }
    print OUTPUT @file;
}

Thanks for all your help. I know it is wrong to parse with a regex, but I just want this one to work.

Postmortem: The problem is almost solved. And I shame those who told me that a regex is not good -- I knew that to begin with. But then again, I wanted something fast and had programmed the XSLT that produced it. In this case I didn't have the source to run it again, otherwise I would program it into the XSLT.

like image 644
Overflown Avatar asked Dec 18 '22 08:12

Overflown


1 Answers

I agree that HTML can't really be parsed by regexes, but for quick little hacks on HTML that you know the format of, regexes work great. The trick to doing repetition replacements with a regex is to put the repetition into the regex. If you don't do that you'll run into trouble syncing the position of the regex matcher with the input you're reading.

Here's the quick-and-dirty way I'd write the Perl. It removes the third div element even when it is nested within the first two divs. The whole file is read and then I use the "g" global replace modifier to make the regex do the counting. If you haven't seen the "x" modifier before, all it does is let you add spaces for formatting—the spaces are ignored in the regex.

remove(@ARGV);

sub remove {
  my ($input, $output) = @_;

  open(INPUT, "<", $input) or die "couldn't open file $input: $!\n";
  open(OUTPUT, ">", $output) or die "couldn't open file $output: $!\n";

  my $content = join("", <INPUT>);
  close(INPUT);

  $content =~ s|(.*? <div \s+ class="cell"> .*? <div \s+ class="cell"> .*?)
                <div \s+ class="cell"> .*? </div>
                (.*? <div \s+ class="cell">)|$1$2|sxg;

  print OUTPUT $content;
  close OUTPUT;
}
like image 59
Ken Fox Avatar answered Jan 04 '23 00:01

Ken Fox