I need to get the text inside the two elements into a string: <pre class="prettyprint lang-py prettyprint-override"><code>source_code = """<a href="#">Martin Elias</a>""" >>> text 'Martin Elias' </code></pre> How could I achieve this?

I searched "python parse html" and this was the first result: https://docs.python.org/2/library/htmlparser.html This code is taken from the python docs <pre class="prettyprint"><code>from HTMLParser import HTMLParser # create a subclass and override the handler methods class MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs): print "Encountered a start tag:", tag def handle_endtag(self, tag): print "Encountered an end tag :", tag def handle_data(self, data): print "Encountered some data :", data # instantiate the parser and fed it some HTML parser = MyHTMLParser() parser.feed('<html><head><title>Test</title></head>' '<body><h1>Parse me!</h1></body></html>') </code></pre> Here is the result: <pre class="prettyprint"><code>Encountered a start tag: html Encountered a start tag: head Encountered a start tag: title Encountered some data : Test Encountered an end tag : title Encountered an end tag : head Encountered a start tag: body Encountered a start tag: h1 Encountered some data : Parse me! Encountered an end tag : h1 Encountered an end tag : body Encountered an end tag : html </code></pre> Using this and by looking at the code in HTMLParser I came up with this: <pre class="prettyprint"><code>class myhtmlparser(HTMLParser): def __init__(self): self.reset() self.NEWTAGS = [] self.NEWATTRS = [] self.HTMLDATA = [] def handle_starttag(self, tag, attrs): self.NEWTAGS.append(tag) self.NEWATTRS.append(attrs) def handle_data(self, data): self.HTMLDATA.append(data) def clean(self): self.NEWTAGS = [] self.NEWATTRS = [] self.HTMLDATA = [] </code></pre> You can use it like this: <pre class="prettyprint"><code>from HTMLParser import HTMLParser pstring = source_code = """<a href="#">Martin Elias</a>""" class myhtmlparser(HTMLParser): def __init__(self): self.reset() self.NEWTAGS = [] self.NEWATTRS = [] self.HTMLDATA = [] def handle_starttag(self, tag, attrs): self.NEWTAGS.append(tag) self.NEWATTRS.append(attrs) def handle_data(self, data): self.HTMLDATA.append(data) def clean(self): self.NEWTAGS = [] self.NEWATTRS = [] self.HTMLDATA = [] parser = myhtmlparser() parser.feed(pstring) # Extract data from parser tags = parser.NEWTAGS attrs = parser.NEWATTRS data = parser.HTMLDATA # Clean the parser parser.clean() # Print out our data print tags print attrs print data </code></pre> Now you should be able to extract your data from those lists easily. I hope this helped!

Parsing HTML to get text inside an element

Tags:

python

html

python-2.x

html-parser

I need to get the text inside the two elements into a string:

source_code = """<span class="UserName"><a href="#">Martin Elias</a></span>"""  >>> text 'Martin Elias'

How could I achieve this?

323

asked Aug 03 '12 22:08

Martin Eliáš

1 Answers

I searched "python parse html" and this was the first result: https://docs.python.org/2/library/htmlparser.html

This code is taken from the python docs

from HTMLParser import HTMLParser      # create a subclass and override the handler methods     class MyHTMLParser(HTMLParser):         def handle_starttag(self, tag, attrs):             print "Encountered a start tag:", tag         def handle_endtag(self, tag):             print "Encountered an end tag :", tag         def handle_data(self, data):             print "Encountered some data  :", data      # instantiate the parser and fed it some HTML     parser = MyHTMLParser()     parser.feed('<html><head><title>Test</title></head>'                 '<body><h1>Parse me!</h1></body></html>')

Here is the result:

Encountered a start tag: html Encountered a start tag: head Encountered a start tag: title Encountered some data  : Test Encountered an end tag : title Encountered an end tag : head Encountered a start tag: body Encountered a start tag: h1 Encountered some data  : Parse me! Encountered an end tag : h1 Encountered an end tag : body Encountered an end tag : html

Using this and by looking at the code in HTMLParser I came up with this:

class myhtmlparser(HTMLParser):     def __init__(self):         self.reset()         self.NEWTAGS = []         self.NEWATTRS = []         self.HTMLDATA = []     def handle_starttag(self, tag, attrs):         self.NEWTAGS.append(tag)         self.NEWATTRS.append(attrs)     def handle_data(self, data):         self.HTMLDATA.append(data)     def clean(self):         self.NEWTAGS = []         self.NEWATTRS = []         self.HTMLDATA = []

You can use it like this:

from HTMLParser import HTMLParser  pstring = source_code = """<span class="UserName"><a href="#">Martin Elias</a></span>"""   class myhtmlparser(HTMLParser):     def __init__(self):         self.reset()         self.NEWTAGS = []         self.NEWATTRS = []         self.HTMLDATA = []     def handle_starttag(self, tag, attrs):         self.NEWTAGS.append(tag)         self.NEWATTRS.append(attrs)     def handle_data(self, data):         self.HTMLDATA.append(data)     def clean(self):         self.NEWTAGS = []         self.NEWATTRS = []         self.HTMLDATA = []  parser = myhtmlparser() parser.feed(pstring)  # Extract data from parser tags  = parser.NEWTAGS attrs = parser.NEWATTRS data  = parser.HTMLDATA  # Clean the parser parser.clean()  # Print out our data print tags print attrs print data

Now you should be able to extract your data from those lists easily. I hope this helped!

114

answered Sep 20 '22 00:09

LISTERINE

Related questions
                            
                                sqlalchemy - join child table with 2 conditions
                            
                                Python read file as stream from HDFS
                            
                                Show decimal places and scientific notation on the axis of a matplotlib plot
                            
                                Stanford nlp for python
                            
                                Django Left Outer Join
                            
                                Extract RGB or 6 digit code from Seaborn palette
                            
                                pip with embedded python
                            
                                Django Shell No module named settings
                            
                                Regexp finding longest common prefix of two strings
                            
                                Length of the longest sublist? [duplicate]
                            
                                Extract day of year and Julian day from a string date
                            
                                Django DeleteView without confirmation template
                            
                                How do I convert timestamp to datetime.date in pandas dataframe?
                            
                                How to switch Python versions in Terminal?
                            
                                How to add seconds on a datetime value in Python?
                            
                                How to run one last function before getting killed in Python?
                            
                                Python Number Limit
                            
                                What is a Pythonic way of doing the following transformation on a list of dicts?
                            
                                The "next" parameter, redirect, django.contrib.auth.login
                            
                                OpenCV - Reading a 16 bit grayscale image

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With