Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a good Javascript based HTML parsing library available?

My goal is to take HTML entered by an end user, remove certain unsafe tags like <script>, and add it to the document. Does anybody know of a good Javascript library to sanitize html?

I searched around and found a few online, including John Resig's HTML parser, Erik Arvidsson's simple html parser, and Google's Caja Sanitizer, but I haven't been able to find much information about whether people have had good experiences using these libraries, and I'm worried that they aren't really robust enough to handle arbitrary HTML. Would I be better off just sending the HTML to my Java server for sanitization?

like image 639
nas Avatar asked Jul 04 '10 23:07

nas


People also ask

What is the best library to parse JavaScript?

Contrary to what we have found for Java and C# there is not a definitive choice: there are many good choices to parse JavaScript. The three most popular libraries seems to be: Acorn, Esprima and UglifyJS. We are not going to say which one it is best because they all seem to be awesome, updated and well supported.

What is the best library to process HTML?

The goal of this article is helping you to find the right library to process HTML. We consider Java, C#, Python, and JavaScript libraries.

How to parse HTML documents with JavaScript?

There is also an extension to integrate scripting in the contest of parsing HTML documents: both C# and JavaScript, based on Jint. Which means that you can parse HTML documents after they have been modified by JavaScript. Both the JavaScript included in the page or a script you add yourself.

What are the most popular JavaScript libraries?

Below, we’ve rounded up the most popular JavaScript libraries available today. jQuery is a classic JavaScript library that’s fast, light-weight, and feature-rich. It was built in 2006 by John Resig at BarCamp NYC. jQuery is free and open-source software with a license from MIT.


2 Answers

You can parse HTML with jQuery, but I'm pretty sure any blacklist based (i.e. filtering out) approach to sanitizing is going to fail - you probably need a "filtering in" based approach and ultimately you don't want to be relying on JavaScript for security anyway. In any case for reference you can use jQuery for DOM-parsing like this:

var htmlS = "<html>etc.etc.";
$(htmlS).remove("script"); /* DONT RELY ON THIS FOR SECURITY */
like image 119
Matt Mitchell Avatar answered Oct 13 '22 04:10

Matt Mitchell


Would I be better off just sending the HTML to my Java server for sanitization?

Yes.

Filtering "unsafe" input must be done server-side. There is no other way to do it. It's not possible to do filtering client-side because the "client-side" could be a web browser or it could just as easily be a bot with a script.

like image 27
thomasrutter Avatar answered Oct 13 '22 04:10

thomasrutter