I found <code>HTMLParser</code> for SAX and <code>xml.minidom</code> for XML. I have a pretty well formed HTML so I don't need a too strong parser - any suggestions?

I would recommend lxml. I like BeautifulSoup, but there are maintenance issues generally and compatibility issues with the later releases. I've been happy using lxml. <hr> Later: the best recommendations are to use lxml, html5lib, or BeautifulSoup 3.0.8. BeautifulSoup 3.1.x is meant for python 3.x and is known to have problems with earlier python versions, as noted on the BeautifulSoup website. Ian Bicking has a good article on using lxml. ElementTree is a further recommendation, but I have never used it. <hr> 2012-01-18: someone has come by and decided to downvote me and Bartosz because we recommended python packages that are easily obtained but not part of the python distribution. So for the highly literal StackOverflowers: "You can use xml.dom.minidom, but no one will recommend this over the alternatives."

Is there a built in package to parse html into dom?

2 Answers

I would recommend lxml. I like BeautifulSoup, but there are maintenance issues generally and compatibility issues with the later releases. I've been happy using lxml.

Later: the best recommendations are to use lxml, html5lib, or BeautifulSoup 3.0.8. BeautifulSoup 3.1.x is meant for python 3.x and is known to have problems with earlier python versions, as noted on the BeautifulSoup website.

Ian Bicking has a good article on using lxml.

ElementTree is a further recommendation, but I have never used it.

2012-01-18: someone has come by and decided to downvote me and Bartosz because we recommended python packages that are easily obtained but not part of the python distribution. So for the highly literal StackOverflowers: "You can use xml.dom.minidom, but no one will recommend this over the alternatives."

149

answered Sep 20 '22 20:09

hughdbrown

BeautifulSoup and lxml are great, but not appropriate answers here since the question is about builtins. Here is an example of using the builtin minidom module to parse an HTML string. Tested with cPython 3.5.2:

from xml.dom.minidom import parseString  html_string = """ <!DOCTYPE html> <html><head><title>title</title></head><body><p>test</p></body></html> """  # extract the text value of the document's <p> tag: doc = parseString(html_string) paragraph = doc.getElementsByTagName("p")[0] content = paragraph.firstChild.data  print(content)

However, as indicated in Jesse Hogan's comment, this will fail on HTML entities not recognized by mindom. Here is an updated solution using the Python3 html.parser module:

from html.parser import HTMLParser  html_string = """ <!DOCTYPE html> <html><head><title>title</title></head><body><p>&nbsp;test</p><div>not in p</div></body></html> """  class Parser(HTMLParser):     def __init__(self):         HTMLParser.__init__(self)         self.in_p = []      def handle_starttag(self, tag, attrs):         if (tag == 'p'):             self.in_p.append(tag)      def handle_endtag(self, tag):         if (tag == 'p'):             self.in_p.pop()      def handle_data(self, data):         if self.in_p:             print("<p> data :", data)  parser = Parser() parser.feed(html_string)

answered Sep 20 '22 20:09

Joseph Sheedy

Related questions
                            
                                How to write a custom `.assertFoo()` method in Python?
                            
                                Different std in pandas vs numpy
                            
                                Django Rest Framework - APIView Pagination
                            
                                How does python compute the hash of a tuple
                            
                                Is there a pure Python Lucene?
                            
                                how to remove attribute of a etree Element?
                            
                                How can I make Perl and Python print each line of the program being executed?
                            
                                Flask url_for URLs in Javascript
                            
                                How to parse string dates with 2-digit year?
                            
                                "subprocess.Popen" - checking for success and errors
                            
                                Break // in x axis of matplotlib [duplicate]
                            
                                Convert Pandas dataframe to Dask dataframe
                            
                                Concepts of backref and back_populate in SQLalchemy?
                            
                                Get original indices of a sorted Numpy array
                            
                                Cython: (Why / When) Is it preferable to use Py_ssize_t for indexing?
                            
                                Better way to mock class attribute in python unit test
                            
                                Add column of empty lists to DataFrame
                            
                                Python, mock: raise exception [closed]
                            
                                How to clear Cuda memory in PyTorch
                            
                                WindowsError: [Error 126] The specified module could not be found

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there a built in package to parse html into dom?

Tags:

python

html

dom

parsing

Guy

People also ask

2 Answers

hughdbrown

Joseph Sheedy

Recent Activity

Donate For Us