I would like to extract all the urls and titles from a paragraph of text.
Les <a href="http://test.com/blop" class="c_link-blue">résultats du sondage</a> sur les remakes et suites souhaités sont <a href="http://test.com" class="c_link-blue">dans le blog</a>.
I am able to get all the href thanks to the following regex, but I don't know how to get in addition, the title between the <a></a>
tags ?
preg_match_all('/<a.*href="?([^" ]*)" /iU', $v['message'], $urls);
The best would be to get an associative array like that
[0] => Array
(
[title] => XXX
[link] => http://test.com/blop
)
[1] => Array
(
[title] => XXX
[link] => http://test.com
)
Thanks for your help
If you still insist on using regex to solve this problem you might be able to parse some with this regex:
<a.*?href="(.*?)".*?>(.*?)</a>
Note that it doesn't use the U modifier as your did.
Update: To have it accept single qoutes, as well as double quotes, you can use the following pattern instead:
<a.*?href=(?:"(.*?)"|'(.*?)').*?>(.*?)</a>
As has been mentioned in the comments don't use a regular expression but a DOM parser.
E.g.
<?php
$doc = new DOMDocument;
$doc->loadhtml( getExampleData() );
$xpath = new DOMXPath($doc);
foreach( $xpath->query('/html/body/p[@id="abc"]//a') as $node ) {
echo $node->getAttribute('href'), ' - ' , $node->textContent, "\n";
}
function getExampleData() {
return '<html><head><title>...</title></head><body>
<p>
not <a href="wrong">this one</a> but ....
</p>
<p id="abc">
Les <a href="http://test.com/blop" class="c_link-blue">résultats du sondage</a> sur les remakes et suites souhaités sont <a href="http://test.com" class="c_link-blue">dans le blog</a>.
</p>
</body></html>';
}
see http://docs.php.net/DOMDocument and http://docs.php.net/DOMXPath
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With