I tried to match js and php url with python re but expression below doesn't work, anyone can help me?
import re, urllib2
response = urllib2.urlopen('https://www.cnn.com')
s = response.read()
p = re.compile(r'^(http|https|//).+?\.(js|php)$')
m = p.findall(s)
for i in m:
print i
Also, some Web pages use //, not http or https. Is there any way to match those, too?
You seem to want to match URLs that end with file extensions js and php, that may start with http, https or //.
Use
import re
s = "https://www.cnn.com/1.js!! http://www.cnn.com/2.php; //some.site.com/3.js,"
res = re.findall(r'(?:\bhttps?:)?//\S*\.(?:js|php)\b', s)
print(res)
See the Python demo
Details:
(?:\bhttps?:)? - an optional sequence of
\b - a leading word boundaryhttps?: - http, 1 or 0 (=optional) s, and a :// - a literal char sequence //\S* - zero or more non-whitespace symbols\. - a dot(?:js|php) - js or php literal char sequences\b - a trailing word boundaryIf you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With