Tumblr and other blogging websites allows people to post embeded codes of videos from youtube and all video networks. but how they filter only the flash object code and remove any other html or scripts? and even they have an automated code that informes you this is not a valid video code. Is this done using REGEX expressions? And Is there a PHP class to do that? Thanks

Generally speaking, using regex is not a good way to deal with HTML : HTML is not regular enough for regular expressions : there are too many variations permitted in the standards... And browsers even accept HTML that's not valid ! In PHP, as your question is tagged as <code>php</code>, a great solution that exists to filter user input is the HTMLPurifier tool. A couple of interesting things are : <ul> <li>It allows you specify which specific tags are allowed </li> <li>For each tag, you can define which specific attributes are allowed </li> </ul> Basically, the idea is to only keep what you specify (white-list), instead of trying to remove bad stuff using a black-list (which will never be quite complete). And if you only specify a list of tags and attributes that can do no harm, only those will be kept -- and the risks of injections are lowered a lot. Quoting HTMLPurifier's home page : <blockquote> HTML Purifier is a standards-compliant HTML filter library written in PHP. HTML Purifier will not only remove all malicious code (better known as XSS) with a thoroughly audited, secure yet permissive whitelist, it will also make sure your documents are standards compliant, something only achievable with a comprehensive knowledge of W3C's specifications. </blockquote> Yes, another great thing is that the code you get as output is valid. <hr> Of course, this will only allow you to clean / filter / purify the HTML input ; it will not allow you to validate that the URL used by the user is both : <ul> <li>correct ; i.e. points to a real content</li> <li>"OK" as defined by your website ; i.e. for example no nudity, ...</li> </ul> About the second point, there's not much one can do about it : the best solution will be to either : <ul> <li>Have a moderator accept / reject the contents before they're put online</li> <li>Give the website's users a way to flag some content as inappropriate, so a moderator takes actions.</li> </ul> Basically, to check the content itself of the video, there is not much choice but have a human being say "ok" or "not ok". About the first point, though, there's hope : some services that host content have APIs that you might want / be able to use. For instance, Youtube provides an API -- see Developer's Guide: PHP. In your case, the Retrieving a specific video entry section looks promising : if you send an HTTP request to an URL that looks like this : <pre class="prettyprint"><code>http://gdata.youtube.com/feeds/api/videos/videoID </code></pre> (Replacing "videoID" by the ID of the video, of course) You'll get some ATOM feed if the video is valid ; and "Invalid id" if it's not This might help you validate at least some URL to contents -- even if you'll have to develop some specific code for each possible content-hosting service that your users like... Now, to extract the identifier of the video from your HTML string... If you're thinking about using regex, you are wrong ;-) The best solution to extract a portion of data from an HTML string is generally to : <ul> <li> Load the HTML using a DOM parser ; <code>DOMDocument::loadHTML</code> is generally pretty helpful, here</li> <li>Go though the document using DOM methods ; either, depending on your situation : <ul> <li> <code>DOMDocument::getElementsByTagName</code>, if you need to iterate over all elements that have a specific tag name ; might be great to iterate over all <code><object></code> or <code><embed></code> tags, for instance</li> <li>Or, if you need something more complex, you could do an XPath query, using the <code>DOMXPath</code> class and its <code>DOMXPath::query</code> method.</li> </ul> </li> </ul> And using DOM will also allow you to modify the HTML document using a standard API -- which might help, in case you want to add some message next to the video, or any other thing like that.

How to protect yourself from XSS when you allow people to post RAW embed codes?

2 Answers

Generally speaking, using regex is not a good way to deal with HTML : HTML is not regular enough for regular expressions : there are too many variations permitted in the standards... And browsers even accept HTML that's not valid !

In PHP, as your question is tagged as php, a great solution that exists to filter user input is the HTMLPurifier tool.

A couple of interesting things are :

It allows you specify which specific tags are allowed
For each tag, you can define which specific attributes are allowed

Basically, the idea is to only keep what you specify (white-list), instead of trying to remove bad stuff using a black-list (which will never be quite complete).

And if you only specify a list of tags and attributes that can do no harm, only those will be kept -- and the risks of injections are lowered a lot.

Quoting HTMLPurifier's home page :

HTML Purifier is a standards-compliant HTML filter library written in PHP.
HTML Purifier will not only remove all malicious code (better known as XSS) with a thoroughly audited, secure yet permissive whitelist, it will also make sure your documents are standards compliant, something only achievable with a comprehensive knowledge of W3C's specifications.

Yes, another great thing is that the code you get as output is valid.

Of course, this will only allow you to clean / filter / purify the HTML input ; it will not allow you to validate that the URL used by the user is both :

correct ; i.e. points to a real content
"OK" as defined by your website ; i.e. for example no nudity, ...

About the second point, there's not much one can do about it : the best solution will be to either :

Have a moderator accept / reject the contents before they're put online
Give the website's users a way to flag some content as inappropriate, so a moderator takes actions.

Basically, to check the content itself of the video, there is not much choice but have a human being say "ok" or "not ok".

About the first point, though, there's hope : some services that host content have APIs that you might want / be able to use.

For instance, Youtube provides an API -- see Developer's Guide: PHP.

In your case, the Retrieving a specific video entry section looks promising : if you send an HTTP request to an URL that looks like this :

http://gdata.youtube.com/feeds/api/videos/videoID

(Replacing "videoID" by the ID of the video, of course)

You'll get some ATOM feed if the video is valid ; and "Invalid id" if it's not

This might help you validate at least some URL to contents -- even if you'll have to develop some specific code for each possible content-hosting service that your users like...

Now, to extract the identifier of the video from your HTML string... If you're thinking about using regex, you are wrong ;-)

The best solution to extract a portion of data from an HTML string is generally to :

Load the HTML using a DOM parser ; DOMDocument::loadHTML is generally pretty helpful, here
Go though the document using DOM methods ; either, depending on your situation :
- DOMDocument::getElementsByTagName, if you need to iterate over all elements that have a specific tag name ; might be great to iterate over all <object> or <embed> tags, for instance
- Or, if you need something more complex, you could do an XPath query, using the DOMXPath class and its DOMXPath::query method.

And using DOM will also allow you to modify the HTML document using a standard API -- which might help, in case you want to add some message next to the video, or any other thing like that.

151

answered Sep 27 '22 23:09

Pascal MARTIN

Take a look at htmlpurifier to start. http://htmlpurifier.org/

answered Sep 27 '22 23:09

goat

Related questions
                            
                                How to match multiple words in regex
                            
                                What is the proper way to use include with or without brackets in php
                            
                                Quick way to list all files in Amazon S3 bucket using php?
                            
                                Yii2: How to prepare for debug and production environment?
                            
                                Paypal can not connect to Sandbox server. Return error 14077410 (sslv3 alert handshake failure)
                            
                                regular expressions - same for all languages?
                            
                                Randomly Losing Session Variables Only In Google Chrome & URL Rewriting
                            
                                When should one use intval and when int [duplicate]
                            
                                php default arguments
                            
                                Connecting directly to Redis with (client side) javascript?
                            
                                PHP - Log stacktrace for warnings?
                            
                                How to debug save_post actions in WordPress?
                            
                                Handling plupload's chunked uploads on the server-side
                            
                                Custom classes in CodeIgniter
                            
                                How to test if PHP mail() has successfully delivered mail
                            
                                Magento - get current product
                            
                                Regex modifier /u in JavaScript?
                            
                                Connect to unix:/var/run/php5-fpm.sock failed. What is wrong with my setup?
                            
                                How to store lightweight formatting (Textile, Markdown) in database?
                            
                                Are there any artificial intelligence projects in PHP out there? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to protect yourself from XSS when you allow people to post RAW embed codes?

Tags:

regex

php

CodeOverload

People also ask

2 Answers

Pascal MARTIN

goat

Recent Activity

Donate For Us