I am working on a PHP site that allows users to post a listing for their business related to the sites theme. This includes a single link URL, some text, and an optional URL for an image file.
Example:
<img src="http://www.somesite.com" width="40" />
<a href="http://www.abcbusiness.com" target="new">ABC Business</a>
<p>
Some text about how great abc business is...
</p>
The HTML in the text is filtered using the class from htmlpurifier.org and the content is checked for bad words, so I feel pretty good about that part.
The image file URL is always placed inside a <img src="" />
tag with a fixed width and validated to be an actual HTTP URL, so that should be Ok.
The dangerous part is the link.
Question: How can I be sure that the link does not point to some SPAM, unsafe, or porn site (using code)?
I can check headers for 404, etc... but is there a quick and easy way to validate a sites content from a link.
EDIT:
I am using a CAPTCHA and do require registration before posting is allowed.
Its going to be very hard to try and determine this yourself by scraping the site URL's in question. You'll probably want to rely on some 3rd party API which can check for you.
http://code.google.com/apis/safebrowsing/
Check out that API, you can send it a URL and it will tell you what it thinks. This one is mainly checking for malware and phishing... not so much porn and spam. There are others that do the same thing, just search around on google.
is there a quick and easy way to validate a sites content from a link.
No. There is no global white/blacklist of URLs which you can use to somehow filter out "bad" sites, especially since your definition of a "bad" site is so unspecific.
Even if you could look at a URL and tell whether the page it points to has bad content, it's trivially easy to disguise a URL these days.
If you really need to prevent this, you should moderate your content. Any automated solution is going to be imperfect and you're going to wind up manually moderating anyways.
Manual moderation, perhaps. I can't think of any way to automate this other than using some sort of blacklist, but even then that is not always reliable as newer sites might not be on the list.
Additionally, you could try using cURL and downloading the index page and looking for certain keywords that would raise a red flag, and then perhaps hold those for manual validation.
I would suggest having a list of these keywords in array (porn, sex, etc). If the index page that you downloaded with cURL has any of those keywords, reject or flag for moderation.
This is not reliable nor is it the most optimized way of approving links.
Ultimately, you should have manual moderation regardless, but if you wish to automate it, this is a possible route for you to take.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With