I need to get the text inside the two elements into a string:
source_code = """<span class="UserName"><a href="#">Martin Elias</a></span>""" >>> text 'Martin Elias'
How could I achieve this?
If you just want to parse HTML and your HTML is intended for the body of your document, you could do the following : (1) var div=document. createElement("DIV"); (2) div. innerHTML = markup; (3) result = div. childNodes; --- This gives you a collection of childnodes and should work not just in IE8 but even in IE6-7.
The term parsing comes from Latin pars (orationis), meaning part (of speech). In your case, HTML parsing is basically: taking in HTML code and extracting relevant information like the title of the page, paragraphs in the page, headings in the page, links, bold text etc.
I searched "python parse html" and this was the first result: https://docs.python.org/2/library/htmlparser.html
This code is taken from the python docs
from HTMLParser import HTMLParser # create a subclass and override the handler methods class MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs): print "Encountered a start tag:", tag def handle_endtag(self, tag): print "Encountered an end tag :", tag def handle_data(self, data): print "Encountered some data :", data # instantiate the parser and fed it some HTML parser = MyHTMLParser() parser.feed('<html><head><title>Test</title></head>' '<body><h1>Parse me!</h1></body></html>')
Here is the result:
Encountered a start tag: html Encountered a start tag: head Encountered a start tag: title Encountered some data : Test Encountered an end tag : title Encountered an end tag : head Encountered a start tag: body Encountered a start tag: h1 Encountered some data : Parse me! Encountered an end tag : h1 Encountered an end tag : body Encountered an end tag : html
Using this and by looking at the code in HTMLParser I came up with this:
class myhtmlparser(HTMLParser): def __init__(self): self.reset() self.NEWTAGS = [] self.NEWATTRS = [] self.HTMLDATA = [] def handle_starttag(self, tag, attrs): self.NEWTAGS.append(tag) self.NEWATTRS.append(attrs) def handle_data(self, data): self.HTMLDATA.append(data) def clean(self): self.NEWTAGS = [] self.NEWATTRS = [] self.HTMLDATA = []
You can use it like this:
from HTMLParser import HTMLParser pstring = source_code = """<span class="UserName"><a href="#">Martin Elias</a></span>""" class myhtmlparser(HTMLParser): def __init__(self): self.reset() self.NEWTAGS = [] self.NEWATTRS = [] self.HTMLDATA = [] def handle_starttag(self, tag, attrs): self.NEWTAGS.append(tag) self.NEWATTRS.append(attrs) def handle_data(self, data): self.HTMLDATA.append(data) def clean(self): self.NEWTAGS = [] self.NEWATTRS = [] self.HTMLDATA = [] parser = myhtmlparser() parser.feed(pstring) # Extract data from parser tags = parser.NEWTAGS attrs = parser.NEWATTRS data = parser.HTMLDATA # Clean the parser parser.clean() # Print out our data print tags print attrs print data
Now you should be able to extract your data from those lists easily. I hope this helped!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With