I'm trying to find a way to make a list of everything between <a>
and </a>
tags. So I have a list of links and I want to get the names of the links (not where the links go, but what they're called on the page). Would be really helpful to me.
Currently I have this:
$lines = preg_split("/\r?\n|\r/", $content); // content is the given page
foreach ($lines as $val) {
if (preg_match("/(<A(.*)>)(<\/A>)/", $val, $alink)) {
$newurl = $alink[1];
// put in array of found links
$links[$index] = $newurl;
$index++;
$is_href = true;
}
}
The preg_match() function is the best option to extract text between HTML tags with REGEX in PHP. If you want to get content between tags, use regular expressions with preg_match() function in PHP. You can also extract the content inside element based on class name or ID using PHP.
Throw in an * (asterisk), and it will match everything. Read more. \s (whitespace metacharacter) will match any whitespace character (space; tab; line break; ...), and \S (opposite of \s ) will match anything that is not a whitespace character.
Regex: Extracting text between two HTML tags.
Below is a simple regex to validate the string against HTML tag pattern. This can be later used to remove all tags and leave text only. /<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>/g; Test it!
The standard disclaimer applies: Parsing HTML with regular expressions is not ideal. Success depends on the well-formedness of the input on a character-by-character level. If you cannot guarantee this, the regex will fail to do the Right Thing at some point.
Having said that:
<a\b[^>]*>(.*?)</a> // match group one will contain the link text
I'm a big fan of regexes, but this is not the right place to use them.
Use a real HTML parser.
I Googled for a PHP HTML parser, and found this one.
If you know you're working with XHTML, then you could use PHP's standard XML parser.
<a\s*(.*)\>(.*)</a>
<a href="http://www.stackoverflow.com">Go to stackoverflow.com</a>
$1 = href="www.stackoverflow.com"
$2 = Go to stackoverflow.com
I answered a similar question to strip everything except a tags here
If I am going to complain about all of the regex solutions, I suppose I need to actually demonstrate how to use a proper HTML parser (the OP makes no indication that the HTML to be parsed is in any way invalid -- so a legitimate parser is absolutely appropriate for script stability and quality).
Now, my advice does require that you become familiar with the basics of DOMDocument (and optionally DOMXPath), but you will see that the syntax is far less cryptic than a regex expression once you understand the components involved. For this reason, I will also argue that this technique will improve the overall readability of your script (for you and future readers of your code).
Code: (Demos)
$html = <<<HTML
<a href="#">hello</a> <abbr href="#">FYI</abbr> <a title="goodbye">later</a>
<a href=https://example.com>no quoted attributes</a>
<A href="https://example.com"
title="some title"
data-key="{\'key\':\'adf0a8dfq<>*1$4%\'">a link with data attribute</A>
and
this is <a title="hello">not a hyperlink</a> but simply an anchor tag
HTML;
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$linkText = [];
foreach ($xpath->evaluate("//a[@href]") as $node) {
$linkText[] = $node->nodeValue;
}
var_export($linkText);
Output:
array (
0 => 'hello',
1 => 'no quoted attributes',
2 => 'a link with data attribute',
)
if you don't care about the href
attribute existing:
Code:
$doc = new DOMDocument();
$doc->loadHTML($html);
$aTags = [];
foreach ($doc->getElementsByTagName('a') as $a) {
$aTags[] = $a->nodeValue;
}
var_export($aTags);
Output:
array (
0 => 'hello',
1 => 'later',
2 => 'no quoted attributes',
3 => 'a link with data attribute',
4 => 'not a hyperlink',
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With