Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression starting with http and ending with pdf?

I have loaded the entire HTML of a page and want to retrieve all the URL's which start with http and end with pdf. I wrote the following which didn't work:

$html = file_get_contents( "http://www.example.com" );
preg_match( '/^http(pdf)$/', $html, $matches );

I'm pretty new to regex but from what I've learned ^ marks the beginning of a pattern and $ marks the end. What am I doing wrong?

like image 384
Weblurk Avatar asked Jun 07 '11 11:06

Weblurk


People also ask

What is a regular expression for a regular language?

ɛ is a regular expression for regular language {ɛ}. If a ∈ Σ (Σ represents the input alphabet ), a is regular expression with language {a}. If a and b are regular expression, a + b is also a regular expression with language {a,b}. If a and b are regular expression, ab (concatenation of a and b) is also regular.

Should regular expressions match across all the text?

So far, we've been writing regular expressions that partially match pieces across all the text. Sometimes this isn't desirable, imagine for example we wanted to match the word "success" in a log file. We certainly don't want that pattern to match a line that says "Error: unsuccessful operation"!

Which regex patterns to match start of line?

2. Regex patterns to match start of line Description Matching Pattern Line starts with number “^\d” or “^ [0-9]” Line starts with character “^ [a-z]” or “^ [A-Z]” Line starts with character (case-insensi ... ^ [a-zA-Z] Line starts with word “^word” 1 more rows ...

What is the difference between regular expression in Java and Perl?

A regular expression is not language specific but they differ slightly for each language. Regular Expression in Java is most similar to Perl. Java Regex classes are present in java.util.regex package that contains three classes: Pattern: Pattern object is the compiled version of the regular expression.


3 Answers

You need to match the characters in the middle of the URL:

/\bhttp[\w%+\/-]+?pdf\b/
  • \b matches a word boundary

  • ^ and $ mark the beginning and end of the entire string. You don't want them here.

  • [...] matches any character in the brackets

  • \w matches any word character

  • + matches one or more of the previous match

  • ? makes the + lazy rather than greedy

like image 66
SLaks Avatar answered Oct 13 '22 10:10

SLaks


preg_match( '/http[^\s]+pdf/', $html, $matches );

Matches http followed by not ([^...]) spaces (\s) one or more times (+) followed by pdf

like image 25
Billy Moon Avatar answered Oct 13 '22 11:10

Billy Moon


Try this,

preg_match( '/\bhttp\S*pdf\b/', $html, $matches );

You need to match the part between the http and the pdf, this is what .*? is doing.

^ matches the start of the string and $ the end, but this is not what you want, when you want to extract those links from a longer text.

\b is matching on word boundaries

Update

for completeness, the .*? would still match too much so exchanged with \S*

\S matches a non whitespace character

like image 37
stema Avatar answered Oct 13 '22 12:10

stema