Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to protect yourself from XSS when you allow people to post RAW embed codes?

Tags:

regex

php

Tumblr and other blogging websites allows people to post embeded codes of videos from youtube and all video networks.

but how they filter only the flash object code and remove any other html or scripts? and even they have an automated code that informes you this is not a valid video code.

Is this done using REGEX expressions? And Is there a PHP class to do that?

Thanks

like image 574
CodeOverload Avatar asked Mar 20 '10 02:03

CodeOverload


People also ask

How can XSS be prevented?

To prevent XSS attacks, your application must validate all the input data, make sure that only the allowlisted data is allowed, and ensure that all variable output in a page is encoded before it is returned to the user.

What is the best defense against cross-site scripting attacks?

A web application firewall (WAF) can be a powerful tool for protecting against XSS attacks. WAFs can filter bots and other malicious activity that may indicate an attack. Attacks can then be blocked before any script is executed.

What is XSS and how do you prevent it?

Cross-site scripting (XSS) is a code injection security attack targeting web applications that delivers malicious, client-side scripts to a user's web browser for execution.

Can XSS be prevented without modifying the source code?

Now let's look at how you can prevent XSS without changing the whole source code. The X-XSS-protection header is designed to prevent XSS attacks the filter is usually present in all kind of modern browser but you need to enforce it to use it. It is supported by Internet Explorer 8+, Chrome, and Firefox etc.


2 Answers

Generally speaking, using regex is not a good way to deal with HTML : HTML is not regular enough for regular expressions : there are too many variations permitted in the standards... And browsers even accept HTML that's not valid !


In PHP, as your question is tagged as php, a great solution that exists to filter user input is the HTMLPurifier tool.

A couple of interesting things are :

  • It allows you specify which specific tags are allowed
  • For each tag, you can define which specific attributes are allowed

Basically, the idea is to only keep what you specify (white-list), instead of trying to remove bad stuff using a black-list (which will never be quite complete).


And if you only specify a list of tags and attributes that can do no harm, only those will be kept -- and the risks of injections are lowered a lot.


Quoting HTMLPurifier's home page :

HTML Purifier is a standards-compliant HTML filter library written in PHP.
HTML Purifier will not only remove all malicious code (better known as XSS) with a thoroughly audited, secure yet permissive whitelist, it will also make sure your documents are standards compliant, something only achievable with a comprehensive knowledge of W3C's specifications.

Yes, another great thing is that the code you get as output is valid.



Of course, this will only allow you to clean / filter / purify the HTML input ; it will not allow you to validate that the URL used by the user is both :

  • correct ; i.e. points to a real content
  • "OK" as defined by your website ; i.e. for example no nudity, ...


About the second point, there's not much one can do about it : the best solution will be to either :

  • Have a moderator accept / reject the contents before they're put online
  • Give the website's users a way to flag some content as inappropriate, so a moderator takes actions.

Basically, to check the content itself of the video, there is not much choice but have a human being say "ok" or "not ok".


About the first point, though, there's hope : some services that host content have APIs that you might want / be able to use.

For instance, Youtube provides an API -- see Developer's Guide: PHP.

In your case, the Retrieving a specific video entry section looks promising : if you send an HTTP request to an URL that looks like this :

http://gdata.youtube.com/feeds/api/videos/videoID

(Replacing "videoID" by the ID of the video, of course)

You'll get some ATOM feed if the video is valid ; and "Invalid id" if it's not

This might help you validate at least some URL to contents -- even if you'll have to develop some specific code for each possible content-hosting service that your users like...


Now, to extract the identifier of the video from your HTML string... If you're thinking about using regex, you are wrong ;-)

The best solution to extract a portion of data from an HTML string is generally to :

  • Load the HTML using a DOM parser ; DOMDocument::loadHTML is generally pretty helpful, here
  • Go though the document using DOM methods ; either, depending on your situation :
    • DOMDocument::getElementsByTagName, if you need to iterate over all elements that have a specific tag name ; might be great to iterate over all <object> or <embed> tags, for instance
    • Or, if you need something more complex, you could do an XPath query, using the DOMXPath class and its DOMXPath::query method.

And using DOM will also allow you to modify the HTML document using a standard API -- which might help, in case you want to add some message next to the video, or any other thing like that.

like image 151
Pascal MARTIN Avatar answered Sep 27 '22 23:09

Pascal MARTIN


Take a look at htmlpurifier to start. http://htmlpurifier.org/

like image 37
goat Avatar answered Sep 27 '22 23:09

goat