Convert CSS Style Attributes to HTML Attributes using Perl

Question

Real quick background : We have a PDFMaker (HTMLDoc) that converts html into a pdf. HTMLDoc doesn't consistently pick up the styles that we need from the html that is provided to us by the client. Thus I'm trying to convert things such as style="width:80px;height:90px;" to height=80 width=90.

My attempt so far has revealed my limited understanding of back references and how to utilize them properly during Perl Regex. I can take an input file and convert it to an output file, but it only catches one "style" per line, and only replaces one name/value pair from that css.

I'm probably approaching this the wrong way but I can't figure out a faster or smarter way to do this in Perl. Any help would be greatly appreciated!

NOTE: The only attributes I'm trying to change for this particular script are "height", "width" and "border," because our client utilizes a tool that automatically applies styles to elements that they drag around with a WYSIWYG-style editor. Obviously, using a regex to strip these out of a lot of places works fairly well, as you just let the table cells be sized by their content, which looks okay, but I figured a quicker way to deal with the issue would just be to replace those three attributes with "width" "height" and "border" attributes, which behave mostly the same as their css counterparts (excepting that CSS allows you to actually customize the width, color, and style of the border, but all they ever use is solid 1px, so I can add a condition to replace "solid 1px" with "border=1". I realize these are not fully equivalent, but for this application it would be a step).

Here's what I've got so far:

#!/usr/bin/perl
if (!@ARGV[0] || !@ARGV[1])
{
  print "Usage: converter.pl [input file] [output file] 
";
  exit;
}
open FILE, "<", @ARGV[0] or die $!;
open OUTFILE, ">", @ARGV[1] or die $!;
my $line;
my $guts;
while ( <FILE> ) {
  $line = $_ ;
  $line =~ /style=\"(.+)\"/;
  $guts = $1;
  $guts =~ /([a-zA-Z]+)\:([a-zA-Z0-9]+)\;/;
  $name = $1;
  $value = $2;
  $guts = $name."=".$value;
  $line =~ s/style=\"(.+)\"/$guts/g;
  print OUTFILE $line ;
}

exit;

Note: This is NOT homework, and no I'm not asking you to do my job for me, this would end up being an internal tool that just sped up the process of formatting our incoming html to work properly in the pdf converter we have.

UPDATE

For those interested, I got an initial working version. This one only replaces width and height, the border attribute we're scrapping for now. But if anyone wanted to see how we did it, take a look...

#!/usr/bin/perl

## NOTES ##
# This script was made to simply replace style attributes with their name/value pair equivalents as attributes.
# It was designed to replace width and height attributes on a metric buttload of table elements from client data we got.
# As such, it's not really designed to handle more than that, and only strips the unit "PX" from the values. 
# All of these can be modified in the second foreach loop, which checks for height and width. 

if (!@ARGV[0] || !@ARGV[1])
{
  print "Usage: quickvert.pl [input file] [output file] 
";
  exit;
}
open FILE, "<", @ARGV[0] or die $!;
open OUTFILE, ">", @ARGV[1] or die $!;
my $line;
my $guts;
my $count = 1;
while ( <FILE> ) {
  $line = $_ ;
  my (@match) = $line =~ /style=\"(.+?)\"/g;
  my $guts;
  my $newguts;
  foreach (@match) {
    #print $_ ."
";
    $guts = $_;
    $guts =~ /([a-zA-Z]+)\:([a-zA-Z0-9]+)\;/;
    $newguts = "";
    foreach my $style (split(/;/,$guts)) {
      my ($name, $value) = split(/:/,$style);
      $value =~ s/px//g;
      if ( $name =~ m/height/g || $name =~ m/width/g ) {
      $newguts .= "$name='$value' ";
      } else {
      $newguts .= "";
      }
    }
    #print "replacing $guts with $newguts on line $count 
";
  $line =~ s/style=\"$guts\"/$newguts/i;
  }

  #print $newguts;



  print OUTFILE $line ;
  $count++;
}

exit;

Adam Bellaire · Accepted Answer

You will have a very difficult time with this, for a few reasons:

Most things that can be accomplished with CSS can't be done with HTML attributes. To deal with this you'd either have to ignore or attempt to compensate for things like margins and padding, etc...
Many things that correspond between HTML attributes and CSS actually behave slightly differently, and you will need to account for this. To deal with this you would have to write specific code for each difference...
Because of the way CSS rules are applied, you basically need to use a complete CSS engine to parse and apply all of the rules before you will know what needs to be done at the element/attribute level. To deal with this you could just ignore anything except inline styles, but...

This work is almost as complicated as writing a rendering engine for a browser. You might be able to deal with a few specific cases, but even there your success rate would be haphazard at best.

EDIT: Given your very specific feature set, I can give you a little advice on your implementation:

You want to be case-insensitive and use a non-greedy match when looking for the value of the style attribute, i.e.:

$line =~ /style=\"(.+?)\"/i;

So that you only find stuff up to the very next double-quote, not the entire content of the line up to the last double quote. Also, you probably want to skip the line if the match isn't found, so:

next unless ($line =~ /style=\"(.+?)\"/i);

For parsing the guts, I'd use split instead of regex:

my $newguts;
foreach my $style (split(/;/,$guts)) {
    my ($name, $value) = split(/:/,$style);
    $newguts .= "$name='$value' ";
}
$line =~ s/style=\"$guts\"/$newguts/i;

Of course, this being Perl there are standard mantras such as always use strict and warnings, try to use named matches rather than $1, $2, etc., but I'm trying to restrict my advice to stuff that will move your solution forward right away.

draegtun · Answer

Have a look on CPAN for HTML parsing modules like HTML::TreeBuilder, HTML::DOM or even XML modules like XML::LibXML.

Below is quick example using HTML::TreeBuilder which adds border="1" attribute to any tag that has style attribute with border content:

use strict;
use warnings;
use HTML::TreeBuilder;

my $data =q{
<html>
<head>
</head>
<body>
<h1>blah</h1>
<p style="color: red;">Red</p>
<span style="width:80px;height:90px;border: 1px solid #000000">Some text</span>
</body>
</html>
};

my $tree = HTML::TreeBuilder->new;
$tree->parse_content( $data );

for my $style ( $tree->look_down( sub { $_[0]->attr('style') } ) ) {
    my $prop = $style->attr( 'style' );
    $style->attr( 'border', 1 ) if $prop =~ m/border/;
}

say $tree->as_HTML;

Which will reproduce the HTML but with border="1" added just to the span tag.

In unison to these modules you can also have a look at CSS and CSS::DOM to help parse the CSS bit.

Convert CSS Style Attributes to HTML Attributes using Perl

Tags:

html

css

perl

NateDSaint

2 Answers

Adam Bellaire

draegtun

Recent Activity

Donate For Us

Convert CSS Style Attributes to HTML Attributes using Perl

Tags:

html

css

perl

NateDSaint

2 Answers

Adam Bellaire

draegtun

Related questions

Recent Activity

Donate For Us