I am working on a webcrawler, where I am trying to make a regex to support the following.
Match: all pages starting with
http://intranet/
But not starting with
http://intranet/sites/ and http://intranet/search/
And in the subfolder /Pages/ Ending with .aspx
Valid sample:
http://intranet/products/Pages/default.aspx
Invalid samples:
http://intranet/Pages/sofus/default.aspx
http://intranet/sites/products/Pages/default.aspx
http://intranet/products/Pages/default.aspx#
So far I have made this
^http://intranet.*/Pages/.*.aspx+
Any help appreciated.
A pattern like this should work:
^http://intranet/(?!sites|search)[^/]+/Pages/.*\.aspx$
The (?!...) creates what's known as a negative lookahead assertion and ensure that the [^/]+ does not start with sites or search.
Here's a demonstration.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With