So I've set up a page where people can submit tutorials. These tutorials are built basically by a TinyMCE editor.
Anyway one could abuse it and just POST their own, non escaped text and insert some malicious <script>
.
So my question is: would it be safe enough to remove <script>
tags with an regular expression? I would run this regex on my backend, before storing it.
I've found this expression for example
<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>
No. It's possible they can use multiple-byte characters to bypass your regexp, or use a combination of mismatched opening and closing tags sneakily, creating fake closing script tags, quoting them in attributes, etc.... Don't attempt to parse potentially noisy/malformed HTML with RegEx, use an HTML parsing engine designed to deal with such concerns. See the famous answer on parsing HTML with regex here: RegEx match open tags except XHTML self-contained tags
If you're looking for one, I swear by this PHP library: http://simplehtmldom.sourceforge.net/
It first cleans the document, by converting noise to entities, before taking into account "script", "style", and "textarea" elements which anything found between the opening and closing tag is meant to be text not HTML. Then it parses the result into a DOM structure to can parse much in the same way you can parse a document with the DOM methods in JavaScript. It comes with a "save" method as well, (which will result the string), so after you're done stripping tags in the page, you'll have your modified, well-formed document. The library I have also tested with large data, and when I was using a regexp before with large which was failing to due PHP memory limits being reached with the regexp, this library parsed such documents without memory issues. So I've tested it quite thoroughly and used it on large projects before, it has never let me down -- like built-in PHP functions/classes have with malformed data.
Edit: Here's an example how to break it:
<scr<script>ipt></scr</script>ipt>alert('XSS!')</script>
Just because the regex is used by jQuery, doesn't make it safe for the server.
Even if you used the "gi" flags, it doesn't matter:
var str="<scr<script>ipt></scr</script>ipt>alert('XSS!')</script>";
str=str.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi,'');
//the "g" flag doesn't help here since you need to start from the beginning, not continue in the middle
alert(str);
But if you used it in a loop, rather than with the "g" flag, you'll get rid of this case I bring up.
Edit 2: If the purpose is sanitizing user-input from all JavaScript concerns, like "onload" and "onclick" properties, why re-invent the wheel? There's http://htmlpurifier.org/ (see the demo)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With