Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is this Python regex negative look ahead not working?

I am trying to collect a set of URLs, using BeautifulSoup, with a very specific criteria. The URLs I want to collect must contain /b-\d+ (/b- followed by a series of numeric values). However, I want to ignore all URLs containing View%20All even if it has /b-\d+ in it. Here are a sample of URLs:

1. http://www.foo.com/bar/b-12312903?sName=View%20All
2. http://www.foo.com/bar/b-832173712873?sName=View%20All
3. http://www.foo.com/bar/b-1208313109283129
4. http://www.foo.com/bar/b-2198123371239489?adCell=W3

Given the above sample, the valid URLs that I want to collect are #3 and #4. I have tried using different negative lookahead regular expressions and they just aren't working for me:

{"href" : re.compile(r"\/b-\d+.+(?!View\%20All)")}
{"href" : re.compile(r"^.+\/b-\d+.+(?!View\%20All$)")}

Can someone tell me what I am doing wrong?

like image 494
lollerskates Avatar asked Sep 03 '25 15:09

lollerskates


1 Answers

{"href" : re.compile(r"\/b-\d+.+(?!View\%20All)")}
{"href" : re.compile(r"^.+\/b-\d+.+(?!View\%20All$)")}

where you got wrong?

when we give (?!View\%20All) it asserts that the View\%20All cannot be matched immediately following the previous pattern which is .+

in effect it means that the look ahead is always true

to illustrate lets check what is matched at by each pattern

http://www.foo.com/bar/b-12312903?sName=View%20All

/b- is obvious

\d matches 12312903

now the problem arises,

.+ matches anything such that it makes the negative assertion (?!View\%20All) successful.

that is say

. matches ?s string that is left unmatched is sName=View%20All which doesn't match (?!View\%20All) at the beginning position shence always successful matching lines 1 and line 2

demo to get a clear image.

Fix??

when using lookaround assertions, fix the positions from where the checking starts

say using a regex like

(\/b-\d+)(\?|$)(?!sName=View\%20All)

which will match 3 and 4 as

http://regex101.com/r/aS5yS2/1

here ? or $ within the string fixes the position from where the negative assertion starts.

like image 198
nu11p01n73R Avatar answered Sep 05 '25 11:09

nu11p01n73R