I need to get the text inside the two elements into a string:
source_code = """<span class="UserName"><a href="#">Martin Elias</a></span>"""  >>> text 'Martin Elias'   How could I achieve this?
If you just want to parse HTML and your HTML is intended for the body of your document, you could do the following : (1) var div=document. createElement("DIV"); (2) div. innerHTML = markup; (3) result = div. childNodes; --- This gives you a collection of childnodes and should work not just in IE8 but even in IE6-7.
The term parsing comes from Latin pars (orationis), meaning part (of speech). In your case, HTML parsing is basically: taking in HTML code and extracting relevant information like the title of the page, paragraphs in the page, headings in the page, links, bold text etc.
I searched "python parse html" and this was the first result: https://docs.python.org/2/library/htmlparser.html
This code is taken from the python docs
from HTMLParser import HTMLParser      # create a subclass and override the handler methods     class MyHTMLParser(HTMLParser):         def handle_starttag(self, tag, attrs):             print "Encountered a start tag:", tag         def handle_endtag(self, tag):             print "Encountered an end tag :", tag         def handle_data(self, data):             print "Encountered some data  :", data      # instantiate the parser and fed it some HTML     parser = MyHTMLParser()     parser.feed('<html><head><title>Test</title></head>'                 '<body><h1>Parse me!</h1></body></html>')   Here is the result:
Encountered a start tag: html Encountered a start tag: head Encountered a start tag: title Encountered some data  : Test Encountered an end tag : title Encountered an end tag : head Encountered a start tag: body Encountered a start tag: h1 Encountered some data  : Parse me! Encountered an end tag : h1 Encountered an end tag : body Encountered an end tag : html   Using this and by looking at the code in HTMLParser I came up with this:
class myhtmlparser(HTMLParser):     def __init__(self):         self.reset()         self.NEWTAGS = []         self.NEWATTRS = []         self.HTMLDATA = []     def handle_starttag(self, tag, attrs):         self.NEWTAGS.append(tag)         self.NEWATTRS.append(attrs)     def handle_data(self, data):         self.HTMLDATA.append(data)     def clean(self):         self.NEWTAGS = []         self.NEWATTRS = []         self.HTMLDATA = []   You can use it like this:
from HTMLParser import HTMLParser  pstring = source_code = """<span class="UserName"><a href="#">Martin Elias</a></span>"""   class myhtmlparser(HTMLParser):     def __init__(self):         self.reset()         self.NEWTAGS = []         self.NEWATTRS = []         self.HTMLDATA = []     def handle_starttag(self, tag, attrs):         self.NEWTAGS.append(tag)         self.NEWATTRS.append(attrs)     def handle_data(self, data):         self.HTMLDATA.append(data)     def clean(self):         self.NEWTAGS = []         self.NEWATTRS = []         self.HTMLDATA = []  parser = myhtmlparser() parser.feed(pstring)  # Extract data from parser tags  = parser.NEWTAGS attrs = parser.NEWATTRS data  = parser.HTMLDATA  # Clean the parser parser.clean()  # Print out our data print tags print attrs print data   Now you should be able to extract your data from those lists easily. I hope this helped!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With