Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a built in package to parse html into dom?

I found HTMLParser for SAX and xml.minidom for XML. I have a pretty well formed HTML so I don't need a too strong parser - any suggestions?

like image 449
Guy Avatar asked May 06 '10 15:05

Guy


People also ask

Which Python package is used to parse an HTML document?

Beautiful Soup. Beautiful Soup (bs4) is a Python library that is used to parse information out of HTML or XML files. It parses its input into an object on which you can run a variety of searches.

What is HTML DOM parser?

The DOMParser interface provides the ability to parse XML or HTML source code from a string into a DOM Document . You can perform the opposite operation—converting a DOM tree into XML or HTML source—using the XMLSerializer interface.

How do you parse HTML?

If you just want to parse HTML and your HTML is intended for the body of your document, you could do the following : (1) var div=document. createElement("DIV"); (2) div. innerHTML = markup; (3) result = div. childNodes; --- This gives you a collection of childnodes and should work not just in IE8 but even in IE6-7.

Can we parse HTML?

jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.


2 Answers

I would recommend lxml. I like BeautifulSoup, but there are maintenance issues generally and compatibility issues with the later releases. I've been happy using lxml.


Later: the best recommendations are to use lxml, html5lib, or BeautifulSoup 3.0.8. BeautifulSoup 3.1.x is meant for python 3.x and is known to have problems with earlier python versions, as noted on the BeautifulSoup website.

Ian Bicking has a good article on using lxml.

ElementTree is a further recommendation, but I have never used it.


2012-01-18: someone has come by and decided to downvote me and Bartosz because we recommended python packages that are easily obtained but not part of the python distribution. So for the highly literal StackOverflowers: "You can use xml.dom.minidom, but no one will recommend this over the alternatives."

like image 149
hughdbrown Avatar answered Sep 20 '22 20:09

hughdbrown


BeautifulSoup and lxml are great, but not appropriate answers here since the question is about builtins. Here is an example of using the builtin minidom module to parse an HTML string. Tested with cPython 3.5.2:

from xml.dom.minidom import parseString  html_string = """ <!DOCTYPE html> <html><head><title>title</title></head><body><p>test</p></body></html> """  # extract the text value of the document's <p> tag: doc = parseString(html_string) paragraph = doc.getElementsByTagName("p")[0] content = paragraph.firstChild.data  print(content) 

However, as indicated in Jesse Hogan's comment, this will fail on HTML entities not recognized by mindom. Here is an updated solution using the Python3 html.parser module:

from html.parser import HTMLParser  html_string = """ <!DOCTYPE html> <html><head><title>title</title></head><body><p>&nbsp;test</p><div>not in p</div></body></html> """  class Parser(HTMLParser):     def __init__(self):         HTMLParser.__init__(self)         self.in_p = []      def handle_starttag(self, tag, attrs):         if (tag == 'p'):             self.in_p.append(tag)      def handle_endtag(self, tag):         if (tag == 'p'):             self.in_p.pop()      def handle_data(self, data):         if self.in_p:             print("<p> data :", data)  parser = Parser() parser.feed(html_string) 
like image 26
Joseph Sheedy Avatar answered Sep 20 '22 20:09

Joseph Sheedy