How can I use the python HTMLParser library to extract data from a specific div tag?

Tags:

I am trying to get a value out of a HTML page using the python HTMLParser library. The value I want to get hold of is within this HTML element:

...
<div id="remository">20</div>
...

This is my HTMLParser class so far:

class LinksParser(HTMLParser.HTMLParser):
  def __init__(self):
    HTMLParser.HTMLParser.__init__(self)
    self.seen = {}

  def handle_starttag(self, tag, attributes):
    if tag != 'div': return
    for name, value in attributes:
    if name == 'id' and value == 'remository':
      #print value
      return

  def handle_data(self, data):
    print data

p = LinksParser()
f = urllib.urlopen("http://example.com/somepage.html")
html = f.read()
p.feed(html)
p.close()

I want the class functionality to get the value 20.

995

asked Jul 18 '10 15:07

Martin

3 Answers

class LinksParser(HTMLParser.HTMLParser):
  def __init__(self):
    HTMLParser.HTMLParser.__init__(self)
    self.recording = 0
    self.data = []

  def handle_starttag(self, tag, attributes):
    if tag != 'div':
      return
    if self.recording:
      self.recording += 1
      return
    for name, value in attributes:
      if name == 'id' and value == 'remository':
        break
    else:
      return
    self.recording = 1

  def handle_endtag(self, tag):
    if tag == 'div' and self.recording:
      self.recording -= 1

  def handle_data(self, data):
    if self.recording:
      self.data.append(data)

self.recording counts the number of nested div tags starting from a "triggering" one. When we're in the sub-tree rooted in a triggering tag, we accumulate the data in self.data.

The data at the end of the parse are left in self.data (a list of strings, possibly empty if no triggering tag was met). Your code from outside the class can access the list directly from the instance at the end of the parse, or you can add appropriate accessor methods for the purpose, depending on what exactly is your goal.

The class could be easily made a bit more general by using, in lieu of the constant literal strings seen in the code above, 'div', 'id', and 'remository', instance attributes self.tag, self.attname and self.attvalue, set by __init__ from arguments passed to it -- I avoided that cheap generalization step in the code above to avoid obscuring the core points (keep track of a count of nested tags and accumulate data into a list when the recording state is active).

184

answered Oct 19 '22 09:10

Alex Martelli

Have You tried BeautifulSoup ?

from bs4 import BeautifulSoup
soup = BeautifulSoup('<div id="remository">20</div>')
tag=soup.div
print(tag.string)

This gives You 20 on output.

answered Oct 19 '22 10:10

modzello86

Little correction at Line 3

HTMLParser.HTMLParser.__init__(self)

it should be

HTMLParser.__init__(self)

The following worked for me though

import urllib2

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

  def __init__(self):
    HTMLParser.__init__(self)
    self.recording = 0
    self.data = []
  def handle_starttag(self, tag, attrs):
    if tag == 'required_tag':
      for name, value in attrs:
        if name == 'somename' and value == 'somevale':
          print name, value
          print "Encountered the beginning of a %s tag" % tag
          self.recording = 1

  def handle_endtag(self, tag):
    if tag == 'required_tag':
      self.recording -=1
      print "Encountered the end of a %s tag" % tag

  def handle_data(self, data):
    if self.recording:
      self.data.append(data)

 p = MyHTMLParser()
 f = urllib2.urlopen('http://www.example.com')
 html = f.read()
 p.feed(html)
 print p.data
 p.close()

answered Oct 19 '22 09:10

pshirishreddy

Related questions
                            
                                Remove list from list in Python [duplicate]
                            
                                Python for loop and iterator behavior
                            
                                Group by and find top n value_counts pandas
                            
                                The number of GET/POST parameters exceeded settings.DATA_UPLOAD_MAX_NUMBER_FIELDS
                            
                                Localized date strftime in Django view
                            
                                How to query directly the table created by Django for a ManyToMany relation?
                            
                                How to add group labels for bar charts in matplotlib
                            
                                How can I install lxml in docker
                            
                                Python packages hash not matching whilst installing using pip
                            
                                Python SQLite parameter substitution with wildcards in LIKE
                            
                                converting currency with $ to numbers in Python pandas
                            
                                Spark SQL Row_number() PartitionBy Sort Desc
                            
                                Delete an element in a JSON object
                            
                                How can I control what scalar form PyYAML uses for my data?
                            
                                How do I find out if a numpy array contains integers?
                            
                                How can I stop python.exe from closing immediately after I get an output? [duplicate]
                            
                                Use str.format() to access object attributes
                            
                                Problems installing python 3.6 with pyenv on Mac OS Big Sur
                            
                                ChoiceField doesn't display an empty label when using a tuple
                            
                                Cython Speed Boost vs. Usability [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I use the python HTMLParser library to extract data from a specific div tag?

Tags:

python

html

parsing

html-parsing