My goal is to take HTML entered by an end user, remove certain unsafe tags like <script>
, and add it to the document. Does anybody know of a good Javascript library to sanitize html?
I searched around and found a few online, including John Resig's HTML parser, Erik Arvidsson's simple html parser, and Google's Caja Sanitizer, but I haven't been able to find much information about whether people have had good experiences using these libraries, and I'm worried that they aren't really robust enough to handle arbitrary HTML. Would I be better off just sending the HTML to my Java server for sanitization?
Contrary to what we have found for Java and C# there is not a definitive choice: there are many good choices to parse JavaScript. The three most popular libraries seems to be: Acorn, Esprima and UglifyJS. We are not going to say which one it is best because they all seem to be awesome, updated and well supported.
The goal of this article is helping you to find the right library to process HTML. We consider Java, C#, Python, and JavaScript libraries.
There is also an extension to integrate scripting in the contest of parsing HTML documents: both C# and JavaScript, based on Jint. Which means that you can parse HTML documents after they have been modified by JavaScript. Both the JavaScript included in the page or a script you add yourself.
Below, we’ve rounded up the most popular JavaScript libraries available today. jQuery is a classic JavaScript library that’s fast, light-weight, and feature-rich. It was built in 2006 by John Resig at BarCamp NYC. jQuery is free and open-source software with a license from MIT.
You can parse HTML with jQuery, but I'm pretty sure any blacklist based (i.e. filtering out) approach to sanitizing is going to fail - you probably need a "filtering in" based approach and ultimately you don't want to be relying on JavaScript for security anyway. In any case for reference you can use jQuery for DOM-parsing like this:
var htmlS = "<html>etc.etc.";
$(htmlS).remove("script"); /* DONT RELY ON THIS FOR SECURITY */
Would I be better off just sending the HTML to my Java server for sanitization?
Yes.
Filtering "unsafe" input must be done server-side. There is no other way to do it. It's not possible to do filtering client-side because the "client-side" could be a web browser or it could just as easily be a bot with a script.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With