Let's say I have a page with a div
. I can easily get that div with soup.find()
.
Now that I have the result, I'd like to print the WHOLE innerhtml
of that div
: I mean, I'd need a string with ALL the html tags and text all toegether, exactly like the string I'd get in javascript with obj.innerHTML
. Is this possible?
To use beautiful soup, you need to install it: $ pip install beautifulsoup4 . Beautiful Soup also relies on a parser, the default is lxml . You may already have it, but you should check (open IDLE and attempt to import lxml). If not, do: $ pip install lxml or $ apt-get install python-lxml .
The BeautifulSoup object represents the parsed document as a whole. For most purposes, you can treat it as a Tag object. This means it supports most of the methods described in Navigating the tree and Searching the tree.
With BeautifulSoup 4 use element.encode_contents()
if you want a UTF-8 encoded bytestring or use element.decode_contents()
if you want a Python Unicode string. For example the DOM's innerHTML method might look something like this:
def innerHTML(element): """Returns the inner HTML of an element as a UTF-8 encoded bytestring""" return element.encode_contents()
These functions aren't currently in the online documentation so I'll quote the current function definitions and the doc string from the code.
encode_contents
- since 4.0.4def encode_contents( self, indent_level=None, encoding=DEFAULT_OUTPUT_ENCODING, formatter="minimal"): """Renders the contents of this tag as a bytestring. :param indent_level: Each line of the rendering will be indented this many spaces. :param encoding: The bytestring will be in this encoding. :param formatter: The output formatter responsible for converting entities to Unicode characters. """
See also the documentation on formatters; you'll most likely either use formatter="minimal"
(the default) or formatter="html"
(for html entities) unless you want to manually process the text in some way.
encode_contents
returns an encoded bytestring. If you want a Python Unicode string then use decode_contents
instead.
decode_contents
- since 4.0.1decode_contents
does the same thing as encode_contents
but returns a Python Unicode string instead of an encoded bytestring.
def decode_contents(self, indent_level=None, eventual_encoding=DEFAULT_OUTPUT_ENCODING, formatter="minimal"): """Renders the contents of this tag as a Unicode string. :param indent_level: Each line of the rendering will be indented this many spaces. :param eventual_encoding: The tag is destined to be encoded into this encoding. This method is _not_ responsible for performing that encoding. This information is passed in so that it can be substituted in if the document contains a <META> tag that mentions the document's encoding. :param formatter: The output formatter responsible for converting entities to Unicode characters. """
BeautifulSoup 3 doesn't have the above functions, instead it has renderContents
def renderContents(self, encoding=DEFAULT_OUTPUT_ENCODING, prettyPrint=False, indentLevel=0): """Renders the contents of this tag as a string in the given encoding. If encoding is None, returns a Unicode string.."""
This function was added back to BeautifulSoup 4 (in 4.0.4) for compatibility with BS3.
One of the options could be use something like that:
innerhtml = "".join([str(x) for x in div_element.contents])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With