Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup innerhtml?

Let's say I have a page with a div. I can easily get that div with soup.find().

Now that I have the result, I'd like to print the WHOLE innerhtml of that div: I mean, I'd need a string with ALL the html tags and text all toegether, exactly like the string I'd get in javascript with obj.innerHTML. Is this possible?

like image 287
Matteo Monti Avatar asked Nov 13 '11 16:11

Matteo Monti


People also ask

How do I use beautifulsoup4 in Python?

To use beautiful soup, you need to install it: $ pip install beautifulsoup4 . Beautiful Soup also relies on a parser, the default is lxml . You may already have it, but you should check (open IDLE and attempt to import lxml). If not, do: $ pip install lxml or $ apt-get install python-lxml .

Is parser an object of BeautifulSoup?

The BeautifulSoup object represents the parsed document as a whole. For most purposes, you can treat it as a Tag object. This means it supports most of the methods described in Navigating the tree and Searching the tree.


2 Answers

TL;DR

With BeautifulSoup 4 use element.encode_contents() if you want a UTF-8 encoded bytestring or use element.decode_contents() if you want a Python Unicode string. For example the DOM's innerHTML method might look something like this:

def innerHTML(element):     """Returns the inner HTML of an element as a UTF-8 encoded bytestring"""     return element.encode_contents() 

These functions aren't currently in the online documentation so I'll quote the current function definitions and the doc string from the code.

encode_contents - since 4.0.4

def encode_contents(     self, indent_level=None, encoding=DEFAULT_OUTPUT_ENCODING,     formatter="minimal"):     """Renders the contents of this tag as a bytestring.      :param indent_level: Each line of the rendering will be        indented this many spaces.      :param encoding: The bytestring will be in this encoding.      :param formatter: The output formatter responsible for converting        entities to Unicode characters.     """ 

See also the documentation on formatters; you'll most likely either use formatter="minimal" (the default) or formatter="html" (for html entities) unless you want to manually process the text in some way.

encode_contents returns an encoded bytestring. If you want a Python Unicode string then use decode_contents instead.


decode_contents - since 4.0.1

decode_contents does the same thing as encode_contents but returns a Python Unicode string instead of an encoded bytestring.

def decode_contents(self, indent_level=None,                    eventual_encoding=DEFAULT_OUTPUT_ENCODING,                    formatter="minimal"):     """Renders the contents of this tag as a Unicode string.      :param indent_level: Each line of the rendering will be        indented this many spaces.      :param eventual_encoding: The tag is destined to be        encoded into this encoding. This method is _not_        responsible for performing that encoding. This information        is passed in so that it can be substituted in if the        document contains a <META> tag that mentions the document's        encoding.      :param formatter: The output formatter responsible for converting        entities to Unicode characters.     """ 

BeautifulSoup 3

BeautifulSoup 3 doesn't have the above functions, instead it has renderContents

def renderContents(self, encoding=DEFAULT_OUTPUT_ENCODING,                    prettyPrint=False, indentLevel=0):     """Renders the contents of this tag as a string in the given     encoding. If encoding is None, returns a Unicode string..""" 

This function was added back to BeautifulSoup 4 (in 4.0.4) for compatibility with BS3.

like image 145
ChrisD Avatar answered Sep 28 '22 08:09

ChrisD


One of the options could be use something like that:

 innerhtml = "".join([str(x) for x in div_element.contents])  
like image 35
peewhy Avatar answered Sep 28 '22 07:09

peewhy