Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python - How to parse an HTML table

I have a HTML page with about 50 tables on it. Each table has the same layout, but with different values, eg:

<table align="right" class="customTableClass">
<tr align="center">
<td width="25" height="25" class="usernum">value1</td>
<td width="25" height="25" class="usernum">value2</td>
<td width="25" height="25" class="usernum">value3</td>
<td width="25" height="25" class="usernum">value4</td>
<td width="25" height="25" class="usernum">value5</td>
<td width="25" height="25" class="usernum">value6</td>
<td width="25" height="25" class="totalnum">otherVal</td>
</tr>
</table>

My REST server is running django/python so in my urls.py I am calling my def parse_url(): function which obviously I want to do all the work in. My problem is, I'm pretty much a newbie when it comes to python, so literally just don't know where to put my code. I have gotten some code from the HTMLParser python docs, and changed it as follows:

import urllib, urllib2
from django.http import HttpResponse
from HTMLParser import HTMLParser

    class MyHTMLParser(HTMLParser):
        def handle_starttag(self, tag, attrs):
            print "Encountered the beginning of a %s tag" % tag

        def handle_endtag(self, tag):
            print "Encountered the end of a %s tag" % tag

        def handle_data(self, data):
            HttpResponse("Encountered data %s" % data)


    def parse_url(request):
        p = MyHTMLParser()
        url = 'http://www.mysite.com/lists.asp'
        content = urllib.urlopen(url).read()
        p.feed(content)
        return HttpResponse('DONE')

This code, at the moment, doesnt output anything useful. It just prints out DONE, which isnt very useful.

How do I use the class methods such as handle_starttag()? Shouldnt these be called automatically when I use p.feed(content)??

Basically, what I'm trying to accomplish in the end is, when I go to mysite.com/showlist, to be able to output a list saying:

value1
value2
value3
value4
value5
value6

othervalue

This needs to be done in a loop, because there is roughly 50 tables with different values in each table.

Thanks for helping a beginner!!

like image 237
eoinzy Avatar asked May 13 '26 22:05

eoinzy


2 Answers

You are printing the beginning of the answer to stdout, not django. Here is how to get HTMLParser to do your bidding:

import urllib, urllib2
from django.http import HttpResponse
from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def __init__(self, *args, **kwargs):
        self.capture_data = False
        self.data_list = []
        HTMLParser.__init__(self, *args, **kwargs)

    def handle_starttag(self, tag, attrs):
        if tag == 'td':
            self.capture_data = True

    def handle_endtag(self, tag):
        if tag == 'td':
            self.capture_data = False

    def handle_data(self, data):
        if self.capture_data and data and not data.isspace():
            self.data_list.append(data)

def parse_url(request):
    p = MyHTMLParser()
    url = 'http://www.mysite.com/lists.asp'
    content = urllib.urlopen(url).read()
    p.feed(content)
    return HttpResponse(str(p.data_list))

I would recommend putting the class into a utils.py file and keeping in the same folder as your views.py. Then import it in. This will help keep your views.py manageable by only containing views.

like image 94
Gringo Suave Avatar answered May 15 '26 12:05

Gringo Suave


Check out BeautifulSoup here is the documentation http://www.crummy.com/software/BeautifulSoup/documentation.html.

PS: It will be much more flexible including future requirements!

like image 25
Vishal Avatar answered May 15 '26 10:05

Vishal



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!