Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regexp for finding everything between <a> and </a> tags

Tags:

regex

php

I'm trying to find a way to make a list of everything between <a> and </a> tags. So I have a list of links and I want to get the names of the links (not where the links go, but what they're called on the page). Would be really helpful to me.

Currently I have this:

$lines = preg_split("/\r?\n|\r/", $content);  // content is the given page
foreach ($lines as $val) {
  if (preg_match("/(<A(.*)>)(<\/A>)/", $val, $alink)) {     
    $newurl = $alink[1];

    // put in array of found links
    $links[$index] = $newurl;
    $index++;
    $is_href = true;
  }
}
like image 918
Vikram Haer Avatar asked Dec 05 '08 07:12

Vikram Haer


People also ask

How do I get the contents between HTML tags?

The preg_match() function is the best option to extract text between HTML tags with REGEX in PHP. If you want to get content between tags, use regular expressions with preg_match() function in PHP. You can also extract the content inside element based on class name or ID using PHP.

Does * match everything in regex?

Throw in an * (asterisk), and it will match everything. Read more. \s (whitespace metacharacter) will match any whitespace character (space; tab; line break; ...), and \S (opposite of \s ) will match anything that is not a whitespace character.

Which method is used to read text between tags?

Regex: Extracting text between two HTML tags.

How to remove HTML tags from string using regex?

Below is a simple regex to validate the string against HTML tag pattern. This can be later used to remove all tags and leave text only. /<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>/g; Test it!


4 Answers

The standard disclaimer applies: Parsing HTML with regular expressions is not ideal. Success depends on the well-formedness of the input on a character-by-character level. If you cannot guarantee this, the regex will fail to do the Right Thing at some point.

Having said that:

<a\b[^>]*>(.*?)</a>   // match group one will contain the link text
like image 107
Tomalak Avatar answered Oct 05 '22 17:10

Tomalak


I'm a big fan of regexes, but this is not the right place to use them.

Use a real HTML parser.

  • Your code will be clearer
  • It will be more likely to work

I Googled for a PHP HTML parser, and found this one.

If you know you're working with XHTML, then you could use PHP's standard XML parser.

like image 39
slim Avatar answered Oct 05 '22 17:10

slim


<a\s*(.*)\>(.*)</a>

<a href="http://www.stackoverflow.com">Go to stackoverflow.com</a>

$1 = href="www.stackoverflow.com"

$2 = Go to stackoverflow.com

I answered a similar question to strip everything except a tags here

like image 23
Xetius Avatar answered Oct 05 '22 18:10

Xetius


If I am going to complain about all of the regex solutions, I suppose I need to actually demonstrate how to use a proper HTML parser (the OP makes no indication that the HTML to be parsed is in any way invalid -- so a legitimate parser is absolutely appropriate for script stability and quality).

Now, my advice does require that you become familiar with the basics of DOMDocument (and optionally DOMXPath), but you will see that the syntax is far less cryptic than a regex expression once you understand the components involved. For this reason, I will also argue that this technique will improve the overall readability of your script (for you and future readers of your code).

Code: (Demos)

$html = <<<HTML
<a href="#">hello</a> <abbr href="#">FYI</abbr> <a title="goodbye">later</a>
<a href=https://example.com>no quoted attributes</a>
<A href="https://example.com"
title="some title"
data-key="{\'key\':\'adf0a8dfq<>*1$4%\'">a link with data attribute</A>
and
this is <a title="hello">not a hyperlink</a> but simply an anchor tag
HTML;

$dom = new DOMDocument; 
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$linkText = [];
foreach ($xpath->evaluate("//a[@href]") as $node) {
    $linkText[] = $node->nodeValue;
}
var_export($linkText);

Output:

array (
  0 => 'hello',
  1 => 'no quoted attributes',
  2 => 'a link with data attribute',
)    

if you don't care about the href attribute existing:

Code:

$doc = new DOMDocument();
$doc->loadHTML($html);
$aTags = [];
foreach ($doc->getElementsByTagName('a') as $a) {
    $aTags[] = $a->nodeValue;
}
var_export($aTags);

Output:

array (
  0 => 'hello',
  1 => 'later',
  2 => 'no quoted attributes',
  3 => 'a link with data attribute',
  4 => 'not a hyperlink',
)
like image 31
mickmackusa Avatar answered Oct 05 '22 17:10

mickmackusa