Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trouble with parsing table data in perl

I have a long htdoc of similar pattern which goes on like this:

<td class="MODULE_PRODUCTS_CELL " align="center" valign="top" height="100">
<table width="100" summary="products"><tr>
<td align="center" height="75">
<a href="/collections.php?prod_id=50">
<img src="files/products_categories50_t.txt" border="0" alt="products" /></a><\br>
</td>
</tr>
<tr>
<td align="center">
<a href="/collections.php?prod_id=50"><strong>Buffer</strong><br />
</a>
<td>
</tr></table>
</td>

In the above html I want to extract:

  1. collections.php?prod_id=50
  2. files/products_categories50_t.txt
  3. Buffer

I have tried this code to begin with,

#!/usr/local/bin/perl

use strict;
use warnings;
my $filename =  'sr.txt';

open(FILENAME,$filename);
my @str = <FILENAME>;
chomp(@str);
#print "@str";

foreach my  $str(@str){    
     if ($str =~/<td class(.*)<a href(.*?)><\/td>/) {
         print "*****$2\n";
     }    
}

This code is a trial one. However it brings only last occurrence and not each occurrence. Why?

like image 439
user524707 Avatar asked Nov 26 '10 09:11

user524707


2 Answers

SUMMARY

Using patterns on little, limited pieces of reasonably well-defined pieces of HTML is quick and easy. But using them on an entire document containing fully general, open-ended HTML of unforeseeable quirks is, while theoretically possible, in practice much too hard compared with using someone else’s parser that’s already been written for that express purpose. See also this answer for a more general discussion on using patterns on XML or HTML.

Naïve Regex Solution

You’ve asked for a regex solution, so I will provide you such.

#!/usr/bin/perl
use 5.10.0;
use strict;
use warnings;

$/ = undef;
$_ = <DATA>;   # read all input

while (m{ < \s* img [^>]* src \s* = \s* ['"]? ([^<>'"]+) }gsix) {
    print "IMG SRC=$1\n";
}

while (m{ < \s* a [^>]* href \s* = \s* ['"]? ([^<>'"]+) }gsix) {
    print "A HREF=$1\n";
}

while (m{ < \s* strong [^>]* > (.*?) < \s* / \s* strong \s* > }gsix) {
    print "STRONG=$1\n";
}

__END__

<td class="MODULE_PRODUCTS_CELL" align="center" valign="top" height="100">
<table width="100" summary="products">
    <tr>
        <td align="center" height="75">
            <a href="/collections.php?prod_id=50">
                <img src="files/products_categories50_t.txt" border="0" alt="products" />
            </a>
            <br/>
        </td>
    </tr>
    <tr>
        <td align="center">
            <a href="/collections.php?prod_id=50">
                <strong>Buffer</strong><br />
            </a>
        <td>
    </tr>
</table>
</td>

That program, when run, produces this output:

IMG SRC=files/products_categories50_t.txt
A HREF=/collections.php?prod_id=50
A HREF=/collections.php?prod_id=50
STRONG=Buffer

If you are quite certain that works for the particular specimen of HTML that you wish it to, then by all means use it. Notice several things that I do which you didn’t. One of them is not dealing with the HTML a line at a time. That virtually never works.

However, this sort solutions works only on extremely limited forms of valid HTML. You can only use it when you can guarantee that the HTML you’re working with really looks like what you expect it to.

The problem is that it quite often does not look all neat and tidy. For these situations, you are strongly advised to use an HTML parsing class. However, no one seems to have shown you the code to do that. That’s not very helpful.

Wizard-Level Regex Solution

And I’m going to be one of them myself. Because I am going to show you a more general solution for approaching what I believe your take to be, but unlike anyone else who ever posts on Stack Overflow, I’m going to use regexes to do it, just to show you that it can be done, but that you do not wish to do it this way:

#!/usr/bin/perl
use 5.10.0;
use strict;
use warnings;

$/ = undef;
$_ = <DATA>;   # read all input

our(
    $RX_SUBS,
    $tag_template_rx,
    $script_tag_rx,
    $style_tag_rx,
    $strong_tag_rx,
    $a_tag_rx,
    $img_tag_rx,
);

# strip stuff we aren't supposed to look at
s{ <!    DOCTYPE  .*?         > }{}sx; 
s{ <! \[ CDATA \[ .*?    \]\] > }{}gsx; 

s{ $style_tag_rx  .*?  < (?&WS) / (?&WS) style  (?&WS) > }{}gsix; 
s{ $script_tag_rx .*?  < (?&WS) / (?&WS) script (?&WS) > }{}gsix; 
s{ <!--     .*?        --> }{}gsx;

while (/$img_tag_rx/g) {
    my $tag = $+{TAG};
    printf "IMG tag at %d: %s\n", pos(), $tag;
    while ($tag =~ 
        m{ 
            $RX_SUBS  
            \b src (?&WS) = (?&WS) 
            (?<VALUE> 
                (?: (?&quoted_value) | (?&unquoted_value) ) 
            )
        }gsix) 
    {
        my $value = dequote($+{VALUE});
        print "\tSRC is $value\n";
    } 

} 

while (/$a_tag_rx/g) {
    my $tag = $+{TAG};
    printf "A tag at %d: %s\n", pos(), $tag;
    while ($tag =~ 
        m{ 
            $RX_SUBS  
            \b href (?&WS) = (?&WS) 
            (?<VALUE> 
                (?: (?&quoted_value) | (?&unquoted_value) ) 
            )
        }gsix) 
    {
        my $value = dequote($+{VALUE});
        print "\tHREF is $value\n";
    } 
} 

while (m{
            $strong_tag_rx  (?&WS) 
            (?<BODY> .*? )  (?&WS) 
            < (?&WS) / (?&WS) strong (?&WS) > 
        }gsix) 
{
    my ($tag, $body) = @+{ qw< TAG BODY > };
    printf "STRONG tag at %d: %s\n\tBODY=%s\n", 
            pos(), $+{TAG}, $+{BODY};
} 

exit;

sub dequote { 
    my $string = shift();
    $string =~ s{
        ^
        (?<quote>   ["']      )
        (?<BODY> 
            (?: (?! \k<quote> ) . ) *
        )
        \k<quote> 
        $
    }{$+{BODY}}gsx;
    return $string;
}

sub load_patterns { 

    $RX_SUBS = qr{ (?(DEFINE)

        (?<any_attribute> 
            \b \w+
            (?&WS) = (?&WS) 
            (?:
                (?&quoted_value)
              | (?&unquoted_value)
            )
        )

        (?<unquoted_value> 
            (?&unwhite_chunk) 
        )

        (?<quoted_value>
            (?<quote>   ["']      )
            (?: (?! \k<quote> ) . ) *
            \k<quote> 
        )

        (?<unwhite_chunk>   
            (?:
                # (?! [<>'"] ) 
                (?! > ) 
                \S
            ) +   
        )

        (?<WS>     \s *   )

        (?<end_tag>          
            (?&html_end_tag)
          | (?&xhtml_end_tag)
        )

        (?<html_end_tag>       >  )
        (?<xhtml_end_tag>    / >  )

      ) # end DEFINE

    }six;

    my $_TAG_SUBS = $RX_SUBS . q{ (?(DEFINE)

        (?<attributes>
            (?: 
                (?&WS) 
                (?&one_attribute) 
            ) *
        )

        (?<one_attribute>
            (?= (?&legal_attribute) )
            (?&any_attribute) 
        )

        (?<optional_attribute>
            (?&permitted_attribute)
          | (?&deprecated_attribute)
        )

        (?<legal_attribute> 
            (?: (?&required_attribute)
              | (?&optional_attribute)
              | (?&standard_attribute)
              | (?&event_attribute)
              # for LEGAL parse only, comment out next line 
              | (?&illegal_attribute)
            )
        )

        (?<optional_attribute>
            (?&permitted_attribute)
          | (?&deprecated_attribute)
        )

        (?<illegal_attribute> \b \w+ \b )

        (?<tag>
            (?&start_tag)
            (?&WS) 
            (?&attributes) 
            (?&WS) 
            (?&end_tag)
        )

      ) # end DEFINE

    };  # this is a q tag, not a qr

    $tag_template_rx = qr{ 

            $_TAG_SUBS

        (?<TAG> (?&XXX_tag) )

        (?(DEFINE)
            (?<XXX_tag>     (?&tag)             )
            (?<start_tag>  < (?&WS) XXX \b      )
            (?<required_attribute>      (*FAIL) )
            (?<standard_attribute>      (*FAIL) )
            (?<event_attribute>         (*FAIL) )
            (?<permitted_attribute>     (*FAIL) )
            (?<deprecated_attribute>    (*FAIL) )

        ) # end DEFINE
    }six;

    $script_tag_rx = qr{   

            $_TAG_SUBS

        (?<TAG> (?&script_tag) )
        (?(DEFINE)
            (?<script_tag>  (?&tag)                )
            (?<start_tag>  < (?&WS) style \b       )
            (?<required_attribute>      type )
            (?<permitted_attribute>             
                charset     
              | defer
              | src
              | xml:space
            )
            (?<standard_attribute>      (*FAIL) )
            (?<event_attribute>         (*FAIL) )
            (?<deprecated_attribute>    (*FAIL) )
        ) # end DEFINE
    }six;

    $style_tag_rx = qr{    

            $_TAG_SUBS

        (?<TAG> (?&style_tag) )

        (?(DEFINE)

            (?<style_tag>  (?&tag)  )

            (?<start_tag>  < (?&WS) style \b       )

            (?<required_attribute>      type    )
            (?<permitted_attribute>     media   )

            (?<standard_attribute>
                dir
              | lang
              | title
              | xml:lang
            )

            (?<event_attribute>         (*FAIL) )
            (?<permitted_attribute>     (*FAIL) )
            (?<deprecated_attribute>    (*FAIL) )

        )  # end define

    }six;

    $strong_tag_rx = qr{    

            $_TAG_SUBS

        (?<TAG> (?&strong_tag) )

        (?(DEFINE)

            (?<strong_tag>  (?&tag)  )

            (?<start_tag>  
                < (?&WS) 
                strong 
                \b       
            )

            (?<standard_attribute>
                class       
              | dir 
              | ltr 
              | id  
              | lang        
              | style       
              | title       
              | xml:lang
            )

            (?<event_attribute>
                on click    
                on dbl click        
                on mouse down       
                on mouse move       
                on mouse out        
                on mouse over       
                on mouse up 
                on key down 
                on key press        
                on key up
            )

            (?<required_attribute>      (*FAIL) )
            (?<permitted_attribute>     (*FAIL) )
            (?<optional_attribute>      (*FAIL) )
            (?<deprecated_attribute>    (*FAIL) )

        ) # end DEFINE

    }six; 

    $a_tag_rx = qr{         

            $_TAG_SUBS

        (?<TAG> (?&a_tag) )

        (?(DEFINE)
            (?<a_tag>  (?&tag)  )

            (?<start_tag>  
                < (?&WS) 
                a 
                \b       
            )

            (?<permitted_attribute>
                charset     
              | coords      
              | href        
              | href lang   
              | name        
              | rel 
              | rev 
              | shape       
              | rect
              | circle
              | poly        
              | target
            )

            (?<standard_attribute>
                access key  
              | class       
              | dir 
              | ltr 
              | id
              | lang        
              | style       
              | tab index   
              | title       
              | xml:lang
            )

            (?<event_attribute>
                on blur     
              | on click    
              | on dbl click        
              | on focus    
              | on mouse down       
              | on mouse move       
              | on mouse out        
              | on mouse over       
              | on mouse up 
              | on key down 
              | on key press        
                on key up
            )

            (?<required_attribute>      (*FAIL) )
            (?<deprecated_attribute>    (*FAIL) )
        ) # end define
    }xi;

    $img_tag_rx = qr{           
        $_TAG_SUBS
        (?<TAG> (?&image_tag) )
        (?(DEFINE)

            (?<image_tag> (?&tag) )

            (?<start_tag>  
                < (?&WS) 
                img 
                \b       
            )

            (?<required_attribute>
                alt
              | src
            )

            # NB: The white space in string literals 
            #     below DOES NOT COUNT!   It's just 
            #     there for legibility.

            (?<permitted_attribute>
                height
              | is map
              | long desc
              | use map
              | width
            )

            (?<deprecated_attribute>
                 align
               | border
               | hspace
               | vspace
            )

            (?<standard_attribute>
                class
              | dir
              | id
              | style
              | title
              | xml:lang
            )

            (?<event_attribute>
                on abort
              | on click
              | on dbl click
              | on mouse down
              | on mouse out
              | on key down
              | on key press
              | on key up
            )

        ###########################

        ) # end DEFINE

    }six;

}

UNITCHECK { load_patterns() } 

__END__

<td class="MODULE_PRODUCTS_CELL" align="center" valign="top" height="100">
<table width="100" summary="products">
    <tr>
        <td align="center" height="75">
            <a href="/collections.php?prod_id=50">
                <img src="files/products_categories50_t.txt" border="0" alt="products" />
            </a>
            <br/>
        </td>
    </tr>
    <tr>
        <td align="center">
            <a href="/collections.php?prod_id=50">
                <strong>Buffer</strong><br />
            </a>
        <td>
    </tr>
</table>
</td>

That program, when run, produces this output:

IMG tag at 304: <img src="files/products_categories50_t.txt" border="0" alt="products" />
        SRC is files/products_categories50_t.txt
A tag at 214: <a href="/collections.php?prod_id=50">
        HREF is /collections.php?prod_id=50
A tag at 451: <a href="/collections.php?prod_id=50">
        HREF is /collections.php?prod_id=50
STRONG tag at 491: <strong>
        BODY=Buffer

The Choice Is Yours — Or Is It?

Both those solve your problem with regexes. It is possible that you will be able to use the first of my two approaches. I cannot say, because like seemingly all such questions asked here, you haven’t told us enough about the data for us (and perhaps also you) to know for sure whether the naïve approach will suffice.

When it doesn’t, you have two choices.

  1. You can either use the more robust and flexible approach offered by my second technique. Just make certain that you understand it in all its aspects, because otherwise you won’t be able to maintain your code — and neither will anybody else.
  2. Use an HTML parsing class.

I find it unlikely that even 1 person in a 1000 would reasonably make the first of those two choices. In particular, I find it extremely unlikey that someone who asks for help with regexes as simple as those in my first solution would be a person capable of managing the regexes given in my second solution.

Which really leaves you with only one “choice” — if I may use that word so loosely.

like image 182
tchrist Avatar answered Nov 07 '22 15:11

tchrist


You may find that parsing this would be easier with XPath than regexes. Your data could do with being somewhat more semantically structured though, but I guess that might be out of your hands.

Have a look at XML::XPath.

The 10-Minute XPath Tutorial from Automating System Administration with Perl also might be handy.

like image 35
ptomli Avatar answered Nov 07 '22 14:11

ptomli