Clean up ugly WYSIWYG HTML code? Python or *nix utility

Question

I'm finally upgrading (rewriting ;) ) my first Django app, but I am migrating all the content.

I foolishly gave users a full WYSIWYG editor for certain tasks, the HTML code produced is of course terribly ugly with more extra tags than content.

Does anyone know of a library or external shell app I could use to clean up the code?

I use tidy sometimes, but as far as I know that doesn't do what I'm asking. I want to simplify all the extra span and other garbage tags. I cleaned the most offensive offending styles with some regex, but I it would take a really long time to do anything more using just regex.

Any ideas?

jaap3 · Accepted Answer

You could also take a look at Bleach a white-list based HTML sanitizer. It uses html5lib to do what Kyle posted, but you'll get a lot more control over which elements and attributes are allowed in the final output.

Kyle · Answer

Beautiful Soup will probably get you a more complete solution, but you might be able to get some cleanup done more simply with html5lib (if you're OK with html5 rules):

import html5lib
from html5lib import sanitizer, treebuilders, treewalkers, serializer

my_html = "<i>Some html fragment</I>" #intentional 'I'

html_parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("dom"))
dom_tree = html_parser.parseFragment(my_html)
walker = treewalkers.getTreeWalker("dom")
stream = walker(dom_tree)
s = serializer.htmlserializer.HTMLSerializer(omit_optional_tags=False, quote_attr_values=True)
cleaned_html = s.render(stream)
cleaned_html == '<i>Some html fragment</i>"

You can also sanitize the html by initializing your html_parser like this:

html_parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("dom"), tokenizer=sanitizer.HTMLSanitizer)

Clean up ugly WYSIWYG HTML code? Python or *nix utility

Tags:

python

html

regex

django

wysiwyg

UserZer0

2 Answers

jaap3

Kyle

Recent Activity

Donate For Us

Clean up ugly WYSIWYG HTML code? Python or *nix utility

Tags:

python

html

regex

django

wysiwyg

UserZer0

2 Answers

jaap3

Kyle

Related questions

Recent Activity

Donate For Us