Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup replaceWith() method adding escaped html, want it unescaped

I have a python method (thank to this snippet) that takes some html and wraps <a> tags around ONLY unformatted links, using BeautifulSoup and Django's urlize:

from django.utils.html import urlize
from bs4 import BeautifulSoup

def html_urlize(self, text):
    soup = BeautifulSoup(text, "html.parser")

    print(soup)

    textNodes = soup.findAll(text=True)
    for textNode in textNodes:
        if textNode.parent and getattr(textNode.parent, 'name') == 'a':
            continue  # skip already formatted links
        urlizedText = urlize(textNode)
        textNode.replaceWith(urlizedText)

    print(soup)

    return str(soup)

Sample input text (as output by the first print statement) is this:

this is a formatted link <a href="http://google.ca">http://google.ca</a>, this one is unformatted and should become formatted: http://google.ca

The resulting return text (as output by the second print statement) is this:

this is a formatted link <a href="http://google.ca">http://google.ca</a>, this one is unformatted and should become formatted: &lt;a href="http://google.ca"&gt;http://google.ca&lt;/a&gt;

As you can see, it is formatting the link, but it's doing it with escaped html, so when I print it in a template {{ my.html|safe }} it doesn't render as html.

So how can I get these tags that are added with urlize to be unescaped, and render properly as html? I suspect this has something do do with me using it as a method instead of a template filter? I can't actually find the docs on this method, it doesn't appear in django.utils.html.

Edit: It appears the escaping actually happen in this line: textNode.replaceWith(urlizedText).

like image 348
43Tesseracts Avatar asked Oct 04 '15 18:10

43Tesseracts


People also ask

Does BeautifulSoup handle broken HTML?

BeautifulSoup is a Python package that parses broken HTML, just like lxml supports it based on the parser of libxml2.

What function in BeautifulSoup will remove a tag from the HTML tree and destroy it?

Tag. decompose() removes a tag from the tree of a given HTML document, then completely destroys it and its contents.

What is NBSP Python?

Alternatively called a fixed space or hard space, NBSP (non-breaking space) is used in programming and word processing to create a space in a line that cannot be broken by word wrap.

What does BeautifulSoup HTML parser do?

Beautiful Soup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.


1 Answers

You can turn your urlizedText string in to a new BeautifulSoup object and it will be treated as a tag in it's own right rather than text within one (which is escaped as you'd expect)

from django.utils.html import urlize
from bs4 import BeautifulSoup

def html_urlize(self, text):
    soup = BeautifulSoup(text, "html.parser")

    print(soup)

    textNodes = soup.findAll(text=True)
    for textNode in textNodes:
        if textNode.parent and getattr(textNode.parent, 'name') == 'a':
            continue  # skip already formatted links
        urlizedText = urlize(textNode)
        textNode.replaceWith(BeautifulSoup(urlizedText, "html.parser"))

    print(soup)

    return str(soup)
like image 174
Oli Avatar answered Oct 01 '22 06:10

Oli