Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP - remove http/www from message (except for the host domain) to disable clickable links

I have a simple message board, let's say: mywebsite.com, that allows users to post their messages. Currently the board makes all links clickable, ie. when someone posts something that starts with:

http://, https://, www., http://www., https://www.

then the script automatically makes them as links (ie. adds the A href.. tag).

THE PROBLEM - there is too much spam. So my idea is to automatically remove the above http|s/www so that these don't become 'clickable links.' HOWEVER, I want to allow posters to link to pages within my site, ie. not to remove http|s/www when the message contains link/s to mywebsite.com.

My idea was to create two arrays:

$removeParts = array('http://', 'https://', 'www.', 'http://www.', 'https://www.');

$keepParts = array('http://mywebsite.com', 'http://www.mywebsite.com', 'www.mywebsite.com', 'http://mywebsite.com', 'https://www.mywebsite.com', 'https://mywebsite.com');

but I don't know how to use them correctly (probably str_replace could work somehow).

Below is an example of $message which is before posting and after posting:

$message BEFORE:

Hello world, thanks to http://mywebsite/about I learned a lot. I found you on http://www.bing.com, https://google.com/search and on some www.spamwebsite.com/refid=spammer2.

$message AFTER:

Hello world, thanks to http://mywebsite.com/about I learned a lot. I found you on bing.com, google.com/search and on some spamwebsite.com/refid=spammer2.


Please note the user enters clear text into the post form, so script should only work with this clear text (not a href etc.).

like image 801
NonCoder Avatar asked Apr 24 '15 23:04

NonCoder


2 Answers

$url = "http://mywebsite/about";
$parse = parse_url($url);

if($parse["host"] == "mywebsite")
    echo "My site, let's mark it as link";

More info: http://php.net/manual/en/function.parse-url.php

like image 196
Ido Avatar answered Oct 26 '22 05:10

Ido


killSpam() function features:

  • works with single and double-quotes.
  • Invalid html
  • ftp://
  • http://
  • https://
  • file://
  • mailto:

function killSpam($html, $whitelist){

//process html links
preg_match_all('%(<(?:\s+)?a.*?href=["|\'](.*?)["|\'].*?>(.*?)<(?:\s+)?/(?:\s+)?a(?:\s+)?>)%sm', $html, $match, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($match[1]); $i++) {
    if(!preg_match("/$whitelist/", $match[1][$i])){
        $spamsite = $match[3][$i];
        $html = preg_replace("%" . preg_quote($match[1][$i]) . "%",  " (SPAM) ", $html);
    }
}

//process cleartext links
preg_match_all('/(\b(?:(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)[A-Z0-9+&@#\/%?=~_|$!:,.;-]*[A-Z0-9+&@#\/%=~_|$-]|((?:mailto:)?[A-Z0-9._%+-]+@[A-Z0-9._%-]+\.[A-Z]{2,6})\b)|"(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)[^"\r\n]+"|\'(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)[^\'\r\n]+\')/i', $html, $match2, PREG_PATTERN_ORDER);

for ($i = 0; $i < count($match2[1]); $i++) {
     if(!preg_match("/$whitelist/", $match2[1][$i])){
        $spamsite = $match2[1][$i];
        $html = preg_replace("%" . preg_quote($spamsite) . "%",  " (SPAM) ", $html);
    }
}


return $html;

}

Usage:

$html = <<< LOB
 <p>Hello world, thanks to <a href="http://mywebsite.com/about" rel="nofollow">http://mywebsite/about</a> I learned a lot. I found
  you on <a href="http://www.bing.com" rel="nofollow">http://www.bing.com</a>, <a href="https://google.com/search" rel="nofollow">https://google.com/search</a> and on some <a href="http://www.spamwebsite.com" rel="nofollow">www.spamwebsite.com/refid=spammer2< /a >. www.spamme.com, http://morespam.com/?aff=122, http://crazyspammer.com/?money=22 and [email protected], file://spamfile.com/file.txt ftp://spamftp.com/file.exe </p>
LOB;

$whitelist = "(google\.com|yahoo\.com|bing\.com|nicesite\.com|mywebsite\.com)";

$noSpam = killSpam($html, $whitelist);

echo $noSpam;

Spam Example:

I CANNOT POST THE SPAM HTML HERE, I GUESS SO HAS IS OWN killSpam()...- view it at http://pastebin.com/HXCkFeGn

Hello world, thanks to http://mywebsite/about I learned a lot. I found you on http://www.bing.com, https://google.com/search and on some www.spamwebsite.com/refid=spammer2. www.spamme.com, http://morespam.com/?aff=122, http://crazyspammer.com/?money=22 and [email protected], file://spamfile.com/file.txt ftp://spamftp.com/file.exe


Output:

Hello world, thanks to (SPAM) I learned a lot. I found you on http://www.bing.com, https://google.com/search and on some (SPAM) . (SPAM) , (SPAM) , (SPAM) and (SPAM) , (SPAM) (SPAM)


Demo:

http://ideone.com/9IxFrB

like image 26
Pedro Lobito Avatar answered Oct 26 '22 03:10

Pedro Lobito