I have searched a lot of topics but does not seem to find the answer for my specific question. I have created a crawl spider for a website and it works perfectly. I then made a similar one to crawl a similar website but this time I have a small issue. Down to the business:
my start url looks as follows: www.example.com . The page contains the links I want to apply my spider look like:
...
I now have a issue: Every time when I enter the start url, it redirects to www.example.com/locationA automatically and all links I got my spider working include
So my problem is how I can include the www.example.com/locationA in the returned URLs.I even got the log info like:
-2011-11-28 21:25:33+1300 [example.com] DEBUG: Redirecting (302) to from http://www.example.com/>
-2011-11-28 21:25:34+1300[example.com] DEBUG: Redirecting (302) to (referer: None)
Print out from parse_item: www.example.com/locationB
....
I think the issue might be related to that (referer: None) some how. Could anyone please shed some light on this??
I have narrow down this issue by changing the start url to www.example.com/locationB. Since all the pages contain the lists of all locations, this time I got my spider works on:
-www.example.com/locationA
-www.example.com/locationC ...
In a nut shell, I am looking for the way to include the url which is same as (or being redirected from) the start url into the list that the parse_item callback will work on.
For others have the same problem, after a lot of searching, all you need to do is name your callback function to parse_start_url
.
Eg:
rules = (
Rule(LinkExtractor(allow=(), restrict_xpaths=(
'//*[contains(concat( " ", @class, " " ), concat( " ", "pagination-next", " " ))]//a',)), callback="parse_start_url", follow=True),
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With