Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to include the start url in the "allow" rule in SgmlLinkExtractor using a scrapy crawl spider

I have searched a lot of topics but does not seem to find the answer for my specific question. I have created a crawl spider for a website and it works perfectly. I then made a similar one to crawl a similar website but this time I have a small issue. Down to the business:

my start url looks as follows: www.example.com . The page contains the links I want to apply my spider look like:

  • www.example.com/locationA
  • www.example.com/locationB
  • www.example.com/locationC

...

I now have a issue: Every time when I enter the start url, it redirects to www.example.com/locationA automatically and all links I got my spider working include

  • www.example.com/locationB
  • www.example.com/locationC ...

So my problem is how I can include the www.example.com/locationA in the returned URLs.I even got the log info like:

-2011-11-28 21:25:33+1300 [example.com] DEBUG: Redirecting (302) to from http://www.example.com/>

-2011-11-28 21:25:34+1300[example.com] DEBUG: Redirecting (302) to (referer: None)

  • 2011-11-28 21:25:37+1300 [example.com] DEBUG: Redirecting (302) to (referer: www.example.com/locationB)

Print out from parse_item: www.example.com/locationB

....

I think the issue might be related to that (referer: None) some how. Could anyone please shed some light on this??

I have narrow down this issue by changing the start url to www.example.com/locationB. Since all the pages contain the lists of all locations, this time I got my spider works on:

-www.example.com/locationA

-www.example.com/locationC ...

In a nut shell, I am looking for the way to include the url which is same as (or being redirected from) the start url into the list that the parse_item callback will work on.

like image 206
user1068961 Avatar asked Nov 28 '11 08:11

user1068961


1 Answers

For others have the same problem, after a lot of searching, all you need to do is name your callback function to parse_start_url.

Eg:

rules = (
        Rule(LinkExtractor(allow=(), restrict_xpaths=(
            '//*[contains(concat( " ", @class, " " ), concat( " ", "pagination-next", " " ))]//a',)), callback="parse_start_url", follow=True),
    )
like image 137
mindcast Avatar answered Jan 04 '23 20:01

mindcast