I have loaded the entire HTML of a page and want to retrieve all the URL's which start with http and end with pdf. I wrote the following which didn't work:
$html = file_get_contents( "http://www.example.com" );
preg_match( '/^http(pdf)$/', $html, $matches );
I'm pretty new to regex but from what I've learned ^
marks the beginning of a pattern and $
marks the end. What am I doing wrong?
ɛ is a regular expression for regular language {ɛ}. If a ∈ Σ (Σ represents the input alphabet ), a is regular expression with language {a}. If a and b are regular expression, a + b is also a regular expression with language {a,b}. If a and b are regular expression, ab (concatenation of a and b) is also regular.
So far, we've been writing regular expressions that partially match pieces across all the text. Sometimes this isn't desirable, imagine for example we wanted to match the word "success" in a log file. We certainly don't want that pattern to match a line that says "Error: unsuccessful operation"!
2. Regex patterns to match start of line Description Matching Pattern Line starts with number “^\d” or “^ [0-9]” Line starts with character “^ [a-z]” or “^ [A-Z]” Line starts with character (case-insensi ... ^ [a-zA-Z] Line starts with word “^word” 1 more rows ...
A regular expression is not language specific but they differ slightly for each language. Regular Expression in Java is most similar to Perl. Java Regex classes are present in java.util.regex package that contains three classes: Pattern: Pattern object is the compiled version of the regular expression.
You need to match the characters in the middle of the URL:
/\bhttp[\w%+\/-]+?pdf\b/
\b
matches a word boundary
^
and $
mark the beginning and end of the entire string. You don't want them here.
[...]
matches any character in the brackets
\w
matches any word character
+
matches one or more of the previous match
?
makes the +
lazy rather than greedy
preg_match( '/http[^\s]+pdf/', $html, $matches );
Matches http
followed by not ([^...]
) spaces (\s
) one or more times (+
) followed by pdf
Try this,
preg_match( '/\bhttp\S*pdf\b/', $html, $matches );
You need to match the part between the http
and the pdf
, this is what .*?
is doing.
^
matches the start of the string and $
the end, but this is not what you want, when you want to extract those links from a longer text.
\b
is matching on word boundaries
Update
for completeness, the .*?
would still match too much so exchanged with \S*
\S
matches a non whitespace character
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With