I have loaded the entire HTML of a page and want to retrieve all the URL's which start with http and end with pdf. I wrote the following which didn't work: <pre class="prettyprint"><code>$html = file_get_contents( "http://www.example.com" ); preg_match( '/^http(pdf)$/', $html, $matches ); </code></pre> I'm pretty new to regex but from what I've learned <code>^</code> marks the beginning of a pattern and <code>$</code> marks the end. What am I doing wrong?

You need to match the characters in the middle of the URL: <pre class="prettyprint"><code>/\bhttp[\w%+\/-]+?pdf\b/ </code></pre> <ul> <li><code>\b</code> matches a word boundary</li> <li><code>^</code> and <code>$</code> mark the beginning and end of the entire string. You don't want them here.</li> <li><code>[...]</code> matches any character in the brackets</li> <li><code>\w</code> matches any word character</li> <li><code>+</code> matches one or more of the previous match</li> <li><code>?</code> makes the <code>+</code> lazy rather than greedy</li> </ul>

<code>preg_match( '/http[^\s]+pdf/', $html, $matches );</code> Matches <code>http</code> followed by not (<code>[^...]</code>) spaces (<code>\s</code>) one or more times (<code>+</code>) followed by <code>pdf</code>

Regular expression starting with http and ending with pdf?

Tags:

regex

php

preg-match

I have loaded the entire HTML of a page and want to retrieve all the URL's which start with http and end with pdf. I wrote the following which didn't work:

$html = file_get_contents( "http://www.example.com" );
preg_match( '/^http(pdf)$/', $html, $matches );

I'm pretty new to regex but from what I've learned ^ marks the beginning of a pattern and $ marks the end. What am I doing wrong?

384

asked Jun 07 '11 11:06

Weblurk

3 Answers

You need to match the characters in the middle of the URL:

/\bhttp[\w%+\/-]+?pdf\b/

\b matches a word boundary
^ and $ mark the beginning and end of the entire string. You don't want them here.
[...] matches any character in the brackets
\w matches any word character
+ matches one or more of the previous match
? makes the + lazy rather than greedy

answered Oct 13 '22 10:10

SLaks

preg_match( '/http[^\s]+pdf/', $html, $matches );

Matches http followed by not ([^...]) spaces (\s) one or more times (+) followed by pdf

answered Oct 13 '22 11:10

Billy Moon

Try this,

preg_match( '/\bhttp\S*pdf\b/', $html, $matches );

You need to match the part between the http and the pdf, this is what .*? is doing.

^ matches the start of the string and $ the end, but this is not what you want, when you want to extract those links from a longer text.

\b is matching on word boundaries

Update

for completeness, the .*? would still match too much so exchanged with \S*

\S matches a non whitespace character

answered Oct 13 '22 12:10

stema

Related questions
                            
                                PHP construct a Unicode string?
                            
                                Is ORM all or nothing?
                            
                                Can't read from socket (hangs)
                            
                                How to send money to paypal using php
                            
                                Preventing warnings from fsockopen
                            
                                Make MySQL auto-increment id (re) start from 1
                            
                                follow redirects with curl in php
                            
                                How to implement keyboard shortcuts on websites
                            
                                How to pass Global variables to classes in PHP?
                            
                                MySQLi & mysql_real_escape_string() Errors
                            
                                String similarity in PHP: levenshtein like function for long strings
                            
                                htaccess rewrite if redirected file exists
                            
                                PHP Fatal Error: failed opening required
                            
                                get a PUT request with Codeigniter
                            
                                What are The Valid & Readable approaches to Commenting in PHP5?
                            
                                APC and PHP - Broken Sites Due to Cache Mixing
                            
                                "It is not safe to rely on the system's timezone settings"
                            
                                Facebook comment ID issue
                            
                                how to subtract 12 hours 30 minutes from the MySQL from date field
                            
                                adding img tag in zend form

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With