Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to add rel="nofollow" to links with preg_replace()

The function below is designed to apply rel="nofollow" attributes to all external links and no internal links unless the path matches a predefined root URL defined as $my_folder below.

So given the variables...

$my_folder = 'http://localhost/mytest/go/';
$blog_url = 'http://localhost/mytest';

And the content...

<a href="http://localhost/mytest/">internal</a>

<a href="http://localhost/mytest/go/hostgator">internal cloaked link</a>

<a href="http://cnn.com">external</a>

The end result, after replacement should be...

<a href="http://localhost/mytest/">internal</a>

<a href="http://localhost/mytest/go/hostgator" rel="nofollow">internal cloaked link</a>

<a href="http://cnn.com" rel="nofollow">external</a>

Notice that the first link is not altered, since its an internal link.

The link on the second line is also an internal link, but since it matches our $my_folder string, it gets the nofollow too.

The third link is the easiest, since it does not match the blog_url, its obviously an external link.

However, in the script below, ALL of my links are getting nofollow. How can I fix the script to do what I want?

function save_rseo_nofollow($content) {
$my_folder =  $rseo['nofollow_folder'];
$blog_url = get_bloginfo('url');
    preg_match_all('~<a.*>~isU',$content["post_content"],$matches);
    for ( $i = 0; $i <= sizeof($matches[0]); $i++){
        if ( !preg_match( '~nofollow~is',$matches[0][$i])
            && (preg_match('~' . $my_folder . '~', $matches[0][$i]) 
               || !preg_match( '~'.$blog_url.'~',$matches[0][$i]))){
            $result = trim($matches[0][$i],">");
            $result .= ' rel="nofollow">';
            $content["post_content"] = str_replace($matches[0][$i], $result, $content["post_content"]);
        }
    }
    return $content;
}
like image 711
Scott B Avatar asked Feb 18 '11 04:02

Scott B


4 Answers

Thanks @alex for your nice solution. But, I was having a problem with Japanese text. I have fixed it as following way. Also, this code can skip multiple domains with the $whiteList array.

public function addRelNoFollow($html, $whiteList = [])
{
    $dom = new \DOMDocument();
    $dom->preserveWhiteSpace = false;
    $dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
    $a = $dom->getElementsByTagName('a');

    /** @var \DOMElement $anchor */
    foreach ($a as $anchor) {
        $href = $anchor->attributes->getNamedItem('href')->nodeValue;
        $domain = parse_url($href, PHP_URL_HOST);

        // Skip whiteList domains
        if (in_array($domain, $whiteList, true)) {
            continue;
        }

        // Check & get existing rel attribute values
        $noFollow = 'nofollow';
        $rel = $anchor->attributes->getNamedItem('rel');
        if ($rel) {
            $values = explode(' ', $rel->nodeValue);
            if (in_array($noFollow, $values, true)) {
                continue;
            }
            $values[] = $noFollow;
            $newValue = implode($values, ' ');
        } else {
            $newValue = $noFollow;
        }

        // Create new rel attribute
        $rel = $dom->createAttribute('rel');
        $node = $dom->createTextNode($newValue);
        $rel->appendChild($node);
        $anchor->appendChild($rel);
    }

    // There is a problem with saveHTML() and saveXML(), both of them do not work correctly in Unix.
    // They do not save UTF-8 characters correctly when used in Unix, but they work in Windows.
    // So we need to do as follows. @see https://stackoverflow.com/a/20675396/1710782
    return $dom->saveHTML($dom->documentElement);
}
like image 131
biplob Avatar answered Sep 17 '22 04:09

biplob


Try to make it more readable first, and only afterwards make your if rules more complex:

function save_rseo_nofollow($content) {
    $content["post_content"] =
    preg_replace_callback('~<(a\s[^>]+)>~isU', "cb2", $content["post_content"]);
    return $content;
}

function cb2($match) { 
    list($original, $tag) = $match;   // regex match groups

    $my_folder =  "/hostgator";       // re-add quirky config here
    $blog_url = "http://localhost/";

    if (strpos($tag, "nofollow")) {
        return $original;
    }
    elseif (strpos($tag, $blog_url) && (!$my_folder || !strpos($tag, $my_folder))) {
        return $original;
    }
    else {
        return "<$tag rel='nofollow'>";
    }
}

Gives following output:

[post_content] =>
  <a href="http://localhost/mytest/">internal</a>
  <a href="http://localhost/mytest/go/hostgator" rel=nofollow>internal cloaked link</a>    
  <a href="http://cnn.com" rel=nofollow>external</a>

The problem in your original code might have been $rseo which wasn't declared anywhere.

like image 33
mario Avatar answered Sep 18 '22 04:09

mario


Here is the DOMDocument solution...

$str = '<a href="http://localhost/mytest/">internal</a>

<a href="http://localhost/mytest/go/hostgator">internal cloaked link</a>

<a href="http://cnn.com" rel="me">external</a>

<a href="http://google.com">external</a>

<a href="http://example.com" rel="nofollow">external</a>

<a href="http://stackoverflow.com" rel="junk in the rel">external</a>
';
$dom = new DOMDocument();

$dom->preserveWhitespace = FALSE;

$dom->loadHTML($str);

$a = $dom->getElementsByTagName('a');

$host = strtok($_SERVER['HTTP_HOST'], ':');

foreach($a as $anchor) {
        $href = $anchor->attributes->getNamedItem('href')->nodeValue;

        if (preg_match('/^https?:\/\/' . preg_quote($host, '/') . '/', $href)) {
           continue;
        }

        $noFollowRel = 'nofollow';
        $oldRelAtt = $anchor->attributes->getNamedItem('rel');

        if ($oldRelAtt == NULL) {
            $newRel = $noFollowRel;
        } else {
            $oldRel = $oldRelAtt->nodeValue;
            $oldRel = explode(' ', $oldRel);
            if (in_array($noFollowRel, $oldRel)) {
                continue;
            }
            $oldRel[] = $noFollowRel;
            $newRel = implode($oldRel,  ' ');
        }

        $newRelAtt = $dom->createAttribute('rel');
        $noFollowNode = $dom->createTextNode($newRel);
        $newRelAtt->appendChild($noFollowNode);
        $anchor->appendChild($newRelAtt);

}

var_dump($dom->saveHTML());

Output

string(509) "<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<a href="http://localhost/mytest/">internal</a>

<a href="http://localhost/mytest/go/hostgator">internal cloaked link</a>

<a href="http://cnn.com" rel="me nofollow">external</a>

<a href="http://google.com" rel="nofollow">external</a>

<a href="http://example.com" rel="nofollow">external</a>

<a href="http://stackoverflow.com" rel="junk in the rel nofollow">external</a>
</body></html>
"
like image 39
alex Avatar answered Sep 20 '22 04:09

alex


Try this one (PHP 5.3+):

  • skip selected address
  • allow manually set rel parameter

and code:

function nofollow($html, $skip = null) {
    return preg_replace_callback(
        "#(<a[^>]+?)>#is", function ($mach) use ($skip) {
            return (
                !($skip && strpos($mach[1], $skip) !== false) &&
                strpos($mach[1], 'rel=') === false
            ) ? $mach[1] . ' rel="nofollow">' : $mach[0];
        },
        $html
    );
}

Examples:

echo nofollow('<a href="link somewhere" rel="something">something</a>');
// will be same because it's already contains rel parameter

echo nofollow('<a href="http://www.cnn.com">something</a>'); // ad
// add rel="nofollow" parameter to anchor

echo nofollow('<a href="http://localhost">something</a>', 'localhost');
// skip this link as internall link
like image 38
OzzyCzech Avatar answered Sep 20 '22 04:09

OzzyCzech