Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to find all URL and titles

Tags:

regex

url

php

I would like to extract all the urls and titles from a paragraph of text.

Les <a href="http://test.com/blop" class="c_link-blue">résultats du sondage</a> sur les remakes et suites souhaités sont <a href="http://test.com" class="c_link-blue">dans le blog</a>.

I am able to get all the href thanks to the following regex, but I don't know how to get in addition, the title between the <a></a> tags ?

preg_match_all('/<a.*href="?([^" ]*)" /iU', $v['message'], $urls);

The best would be to get an associative array like that

[0] => Array
(
   [title] => XXX
   [link] => http://test.com/blop
)
[1] => Array
(
   [title] => XXX
   [link] => http://test.com
)

Thanks for your help

like image 811
Simon Taisne Avatar asked Jan 18 '23 10:01

Simon Taisne


2 Answers

If you still insist on using regex to solve this problem you might be able to parse some with this regex:

<a.*?href="(.*?)".*?>(.*?)</a>

Note that it doesn't use the U modifier as your did.

Update: To have it accept single qoutes, as well as double quotes, you can use the following pattern instead:

<a.*?href=(?:"(.*?)"|'(.*?)').*?>(.*?)</a>
like image 199
Marcus Avatar answered Jan 29 '23 02:01

Marcus


As has been mentioned in the comments don't use a regular expression but a DOM parser.
E.g.

<?php
$doc = new DOMDocument;
$doc->loadhtml( getExampleData() );

$xpath = new DOMXPath($doc);
foreach( $xpath->query('/html/body/p[@id="abc"]//a') as $node ) {
    echo $node->getAttribute('href'), ' - ' , $node->textContent, "\n";
}

function getExampleData() {
    return '<html><head><title>...</title></head><body>
    <p>
        not <a href="wrong">this one</a> but ....
    </p>
    <p id="abc">
        Les <a href="http://test.com/blop" class="c_link-blue">résultats du sondage</a> sur les remakes et suites souhaités sont <a href="http://test.com" class="c_link-blue">dans le blog</a>.
    </p>
    </body></html>';
}

see http://docs.php.net/DOMDocument and http://docs.php.net/DOMXPath

like image 29
VolkerK Avatar answered Jan 29 '23 00:01

VolkerK