Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing HTML to get text inside an element

I need to get the text inside the two elements into a string:

source_code = """<span class="UserName"><a href="#">Martin Elias</a></span>"""  >>> text 'Martin Elias' 

How could I achieve this?

like image 323
Martin Eliáš Avatar asked Aug 03 '12 22:08

Martin Eliáš


People also ask

How do you parse text in HTML?

If you just want to parse HTML and your HTML is intended for the body of your document, you could do the following : (1) var div=document. createElement("DIV"); (2) div. innerHTML = markup; (3) result = div. childNodes; --- This gives you a collection of childnodes and should work not just in IE8 but even in IE6-7.

What is parse in HTML?

The term parsing comes from Latin pars (orationis), meaning part (of speech). In your case, HTML parsing is basically: taking in HTML code and extracting relevant information like the title of the page, paragraphs in the page, headings in the page, links, bold text etc.


1 Answers

I searched "python parse html" and this was the first result: https://docs.python.org/2/library/htmlparser.html

This code is taken from the python docs

from HTMLParser import HTMLParser      # create a subclass and override the handler methods     class MyHTMLParser(HTMLParser):         def handle_starttag(self, tag, attrs):             print "Encountered a start tag:", tag         def handle_endtag(self, tag):             print "Encountered an end tag :", tag         def handle_data(self, data):             print "Encountered some data  :", data      # instantiate the parser and fed it some HTML     parser = MyHTMLParser()     parser.feed('<html><head><title>Test</title></head>'                 '<body><h1>Parse me!</h1></body></html>') 

Here is the result:

Encountered a start tag: html Encountered a start tag: head Encountered a start tag: title Encountered some data  : Test Encountered an end tag : title Encountered an end tag : head Encountered a start tag: body Encountered a start tag: h1 Encountered some data  : Parse me! Encountered an end tag : h1 Encountered an end tag : body Encountered an end tag : html 

Using this and by looking at the code in HTMLParser I came up with this:

class myhtmlparser(HTMLParser):     def __init__(self):         self.reset()         self.NEWTAGS = []         self.NEWATTRS = []         self.HTMLDATA = []     def handle_starttag(self, tag, attrs):         self.NEWTAGS.append(tag)         self.NEWATTRS.append(attrs)     def handle_data(self, data):         self.HTMLDATA.append(data)     def clean(self):         self.NEWTAGS = []         self.NEWATTRS = []         self.HTMLDATA = [] 

You can use it like this:

from HTMLParser import HTMLParser  pstring = source_code = """<span class="UserName"><a href="#">Martin Elias</a></span>"""   class myhtmlparser(HTMLParser):     def __init__(self):         self.reset()         self.NEWTAGS = []         self.NEWATTRS = []         self.HTMLDATA = []     def handle_starttag(self, tag, attrs):         self.NEWTAGS.append(tag)         self.NEWATTRS.append(attrs)     def handle_data(self, data):         self.HTMLDATA.append(data)     def clean(self):         self.NEWTAGS = []         self.NEWATTRS = []         self.HTMLDATA = []  parser = myhtmlparser() parser.feed(pstring)  # Extract data from parser tags  = parser.NEWTAGS attrs = parser.NEWATTRS data  = parser.HTMLDATA  # Clean the parser parser.clean()  # Print out our data print tags print attrs print data 

Now you should be able to extract your data from those lists easily. I hope this helped!

like image 114
LISTERINE Avatar answered Sep 20 '22 00:09

LISTERINE