Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grabbing the href attribute of an A element

Tags:

html

dom

php

Trying to find the links on a page.

my regex is:

/<a\s[^>]*href=(\"\'??)([^\"\' >]*?)[^>]*>(.*)<\/a>/

but seems to fail at

<a title="this" href="that">what?</a>

How would I change my regex to deal with href not placed first in the a tag?

like image 970
bergin Avatar asked Sep 29 '10 10:09

bergin


People also ask

How do you find the href of an element?

Use the querySelector() method to get an element by an href attribute, e.g. document. querySelector('a[href="https://example.com"]') . The method returns the first element that matches the selector or null if no element with the provided selector exists in the DOM.

What method can be used to retrieve the href attribute of an element?

Use getAttribute() to Get Href in JavaScript The Element interface's getAttribute() method returns the value of a specified attribute for the element.

What is the href attribute of a tag?

The href attribute specifies the URL of the page the link goes to. If the href attribute is not present, the <a> tag will not be a hyperlink. Tip: You can use href="#top" or href="#" to link to the top of the current page!


8 Answers

Reliable Regex for HTML are difficult. Here is how to do it with DOM:

$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $node) {
    echo $dom->saveHtml($node), PHP_EOL;
}

The above would find and output the "outerHTML" of all A elements in the $html string.

To get all the text values of the node, you do

echo $node->nodeValue; 

To check if the href attribute exists you can do

echo $node->hasAttribute( 'href' );

To get the href attribute you'd do

echo $node->getAttribute( 'href' );

To change the href attribute you'd do

$node->setAttribute('href', 'something else');

To remove the href attribute you'd do

$node->removeAttribute('href'); 

You can also query for the href attribute directly with XPath

$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//a/@href');
foreach($nodes as $href) {
    echo $href->nodeValue;                       // echo current attribute value
    $href->nodeValue = 'new value';              // set new attribute value
    $href->parentNode->removeAttribute('href');  // remove attribute
}

Also see:

  • Best methods to parse HTML
  • DOMDocument in php

On a sidenote: I am sure this is a duplicate and you can find the answer somewhere in here

like image 190
Gordon Avatar answered Sep 30 '22 15:09

Gordon


I agree with Gordon, you MUST use an HTML parser to parse HTML. But if you really want a regex you can try this one :

/^<a.*?href=(["\'])(.*?)\1.*$/

This matches <a at the begining of the string, followed by any number of any char (non greedy) .*? then href= followed by the link surrounded by either " or '

$str = '<a title="this" href="that">what?</a>';
preg_match('/^<a.*?href=(["\'])(.*?)\1.*$/', $str, $m);
var_dump($m);

Output:

array(3) {
  [0]=>
  string(37) "<a title="this" href="that">what?</a>"
  [1]=>
  string(1) """
  [2]=>
  string(4) "that"
}
like image 25
Toto Avatar answered Sep 30 '22 14:09

Toto


The pattern you want to look for would be the link anchor pattern, like (something):

$regex_pattern = "/<a href=\"(.*)\">(.*)<\/a>/";
like image 25
Alex Pliutau Avatar answered Sep 30 '22 15:09

Alex Pliutau


why don't you just match

"<a.*?href\s*=\s*['"](.*?)['"]"

<?php

$str = '<a title="this" href="that">what?</a>';

$res = array();

preg_match_all("/<a.*?href\s*=\s*['\"](.*?)['\"]/", $str, $res);

var_dump($res);

?>

then

$ php test.php
array(2) {
  [0]=>
  array(1) {
    [0]=>
    string(27) "<a title="this" href="that""
  }
  [1]=>
  array(1) {
    [0]=>
    string(4) "that"
  }
}

which works. I've just removed the first capture braces.

like image 33
Aif Avatar answered Sep 30 '22 14:09

Aif


For the one who still not get the solutions very easy and fast using SimpleXML

$a = new SimpleXMLElement('<a href="www.something.com">Click here</a>');
echo $a['href']; // will echo www.something.com

Its working for me

like image 21
Milan Malani Avatar answered Sep 30 '22 16:09

Milan Malani


Quick test: <a\s+[^>]*href=(\"\'??)([^\1]+)(?:\1)>(.*)<\/a> seems to do the trick, with the 1st match being " or ', the second the 'href' value 'that', and the third the 'what?'.

The reason I left the first match of "/' in there is that you can use it to backreference it later for the closing "/' so it's the same.

See live example on: http://www.rubular.com/r/jsKyK2b6do

like image 34
CharlesLeaf Avatar answered Sep 30 '22 16:09

CharlesLeaf


I'm not sure what you're trying to do here, but if you're trying to validate the link then look at PHP's filter_var()

If you really need to use a regular expression then check out this tool, it may help: http://regex.larsolavtorvik.com/

like image 42
Adam Avatar answered Sep 30 '22 15:09

Adam


Using your regex, I modified it a bit to suit your need.

<a.*?href=("|')(.*?)("|').*?>(.*)<\/a>

I personally suggest you use a HTML Parser

EDIT: Tested

like image 40
Ruel Avatar answered Sep 30 '22 14:09

Ruel