Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I extract all anchor tags, their hrefs and their anchor text within a string? [duplicate]

I need to process links within an html string in several different ways.

$str = 'My long <a href="http://example.com/abc" rel="link">string</a> has any
        <a href="/local/path" title="with attributes">number</a> of
        <a href="#anchor" data-attr="lots">links</a>.'
$links = extractLinks($str);
foreach ($links as $link) {
    $pattern = "#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie";
    if (preg_match($pattern,$str)) {
        // Process Remote links
        //   For example, replace url with short url,
        //   or replace long anchor text with truncated
    } else {
        // Process Local Links, Anchors

    }
}
function extractLinks($str) {
    // First, I tried DomDocument
    $dom = new DomDocument();
    $dom->loadHTML($str);
    return $dom->getElementsByTagName('a');
    // But this just returns:
    //   DOMNodeList Object
    //   (
    //       [length] => 3
    //   )

    // Then I tried Regex
    if(preg_match_all("|<a.*(?=href=\"([^\"]*)\")[^>]*>([^<]*)</a>|i", $str, $matches)) {
        print_r($matches);
    }
    // But this didn't work either.
}

Desired result of extractLinks($str):

[0] => Array(
           'str' = '<a href="http://example.com/abc" rel="link">string</a>',
           'href' = 'http://example.com/abc';
           'anchorText' = 'string'
       ),
[1] => Array(
           'str' = '<a href="/local/path" title="with attributes">number</a>',
           'href' = '/local/path';
           'anchorText' = 'number'
       ),
[2] => Array(
           'str' = '<a href="#anchor" data-attr="lots">links</a>',
           'href' = '#anchor';
           'anchorText' = 'links'
       );

I need all of these so I can do things like edit the href (add tracking, shorten, etc.), or replace the whole tag with something else (<a href="/u/username">username</a> could become username).

Here's a demo of what I'm trying to do.

like image 424
Ryan Avatar asked May 07 '14 20:05

Ryan


2 Answers

You just need to change it as:

$str = 'My long <a href="http://example.com/abc" rel="link">string</a> has any
    <a href="/local/path" title="with attributes">number</a> of
    <a href="#anchor" data-attr="lots">links</a>.';

$dom = new DomDocument();
$dom->loadHTML($str);
$output = array();
foreach ($dom->getElementsByTagName('a') as $item) {
   $output[] = array (
      'str' => $dom->saveHTML($item),
      'href' => $item->getAttribute('href'),
      'anchorText' => $item->nodeValue
   );
}

By putting it in a loop and using getAttribute, nodeValue and saveHTML(THE_NODE) you will have your output

like image 54
Javad Avatar answered Sep 22 '22 15:09

Javad


Like this

<a\s*href="([^"]+)"[^>]+>([^<]+)</a>
  1. The overall match is what you want for 0 array element
  2. Group#1 capture is what you want for 1 array element
  3. Group#2 capture is what you want for 2 array element

Use preg_match($pattern,$string,$m)

The array elements will be in $m[0] $m[1] $m[3]

Working PHP demo here

$string = 'My long <a href="http://example.com/abc" rel="link">string</a> has any
        <a href="/local/path" title="with attributes">number</a> of
        <a href="#anchor" data-attr="lots">links</a>. ';
$regex='|<a\s*href="([^"]+)"[^>]+>([^<]+)</a>|';
$howmany = preg_match_all($regex,$string,$res,PREG_SET_ORDER);
print_r($res);
like image 21
Hans Schindler Avatar answered Sep 18 '22 15:09

Hans Schindler