<p>The website that I am crawling contains many players and when I click on any player, I can go the his page.</p> <p>The website structure is like this:</p> <pre class="prettyprint"><code><main page> <link to player 1> <link to player 2> <link to player 3> .. .. .. <link to payer n> </main page> </code></pre> <p>And when I click on any link, I go to player's page which is like this:</p> <pre class="prettyprint"><code><player name> <player team> <player age> <player salary> <player date> </code></pre> <p>I want to scrap all the players those age is between 20 and 25 years.</p> <h3>what I am doing</h3> <ol> <li><p>scraping the main page <strong>using first spider</strong>.</p></li> <li><p>getting links <strong>using first spider</strong>.</p></li> <li><p>crawl each link <strong>using second spider</strong>.</p></li> <li><p>get the player informatoin <strong>using second spider</strong>.</p></li> <li><p>save this information in json file <strong>using pipeline</strong>.</p></li> </ol> <h3>my question</h3> <p>how can I return the <code>date</code> value from <code>second spider</code> to the <code>first spider</code></p> <h3> what i have tried</h3> <p>I build my own middelware and i override the <code>process_spider_output</code>. it allows me to print the request but I don't know what else should I do in order to return that <code>date</code> value to my first spider</p> <p>any help is appreciated</p> <h3>Edit</h3> <p>Here is some of the code:</p> <pre class="prettyprint"><code>def parse(self, response): sel = Selector(response) Container = sel.css('div[MyDiv]') for player in Container: extract LINK and TITLE yield Request(LINK, meta={'Title': Title}, callback = self.parsePlayer) def parsePlayer(self,response): player = new PlayerItem(); extract DATE return player </code></pre> <h3>I gave you the general code, not the very specific details in order to make it easy for you </h3>

<h3>You want to discard players outside a range of dates</h3> <p>All you need to do is check the <code>date</code> in <code>parsePlayer</code>, and return only the relevant.</p> <pre class="prettyprint"><code>def parsePlayer(self,response): player = new PlayerItem(); extract DATE if DATE == some_criteria: yield player </code></pre> <h3>You want to scrap every link in order and stop when some date is reached</h3> <p>For example, if you have performance issues (you are scrapping way too much links and you don't need the ones after some limit).</p> <p>Given that Scrapy work in asymmetric requests, there is no real good way to do that. The only way you have is trying to force linear behavior instead of default parallel requests.</p> <p>Let me explain. When you have two callbacks like that, on default behavior scrapy will first parse the first page (main page) and put in its queue all requests for the player pages. <em>Without waiting for that first page to finish being scrapped</em>, it will start treating these requests for player pages (not necessarily in the order it found them).</p> <p>Therefore, when you get the information that the player page <code>p</code> is out of date, it has <em>already</em> sent internal requests for <code>p+1</code>, <code>p+2</code>...<code>p+m</code> (<code>m</code> is basically a random number) AND has probably started treating some of these requests. Possibly even <code>p+1</code> <em>before</em> <code>p</code> (no fixed order, remember).</p> <p>So no way to stop exactly at the right page if you keep this pattern, and no way to interact with <code>parse</code> from <code>parsePlayer</code>.</p> <p>What you <em>can</em> do is force it to follow the links in order, so that you have full control. The drawback is that <strong>it will take a big toll on performance</strong>: if scrapy follows each link one after the other, it means it can't treat them simultaneously as it usually does and it slows things down.</p> <p>The code could be something like:</p> <pre class="prettyprint lang-py prettyprint-override"><code>def parse(self, response): sel = Selector(response) self.container = sel.css('div[MyDiv]') return self.increment(0) # Function that will yield the request for player n°index def increment(index): player = self.container[index] # select current player extract LINK and TITLE yield Request(LINK, meta={'Title': Title, 'index': index}, callback=self.parsePlayer) def parsePlayer(self,response): player = new PlayerItem(); extract DATE yield player if DATE == some_criteria: index = response.meta['index'] + 1 self.increment(index) </code></pre> <p>That way scrapy will get the main page, then the first player, then the main page, then the second player, then the main, etc... until it finds a date that doesn't fit the criteria. Then there is no callback to the main function and the spider stops.</p> <p>This gets a little more complex if you have to also increment the index of the main page (if there are n main pages for example), but the idea stays the same.</p>

scrapy how spider returns value to another spider

Tags:

python

python-2.7

scrapy

The website that I am crawling contains many players and when I click on any player, I can go the his page.

The website structure is like this:

<main page>
<link to player 1>
<link to player 2>
<link to player 3>
..
..
..
<link to payer n>
</main page>

And when I click on any link, I go to player's page which is like this:

<player name>
<player team>
<player age>
<player salary>
<player date>

I want to scrap all the players those age is between 20 and 25 years.

what I am doing

scraping the main page using first spider.
getting links using first spider.
crawl each link using second spider.
get the player informatoin using second spider.
save this information in json file using pipeline.

my question

how can I return the date value from second spider to the first spider

what i have tried

I build my own middelware and i override the process_spider_output. it allows me to print the request but I don't know what else should I do in order to return that date value to my first spider

any help is appreciated

Edit

Here is some of the code:

def parse(self, response):
        sel = Selector(response)
        Container = sel.css('div[MyDiv]')
        for player in Container:
            extract LINK and TITLE
            yield Request(LINK, meta={'Title': Title}, callback = self.parsePlayer)

def parsePlayer(self,response):
    player = new PlayerItem();
    extract DATE
    return player

I gave you the general code, not the very specific details in order to make it easy for you

557

asked Feb 07 '14 13:02

Marco Dinatsoli

1 Answers

You want to discard players outside a range of dates

All you need to do is check the date in parsePlayer, and return only the relevant.

def parsePlayer(self,response):
    player = new PlayerItem();
    extract DATE
    if DATE == some_criteria:
        yield player

You want to scrap every link in order and stop when some date is reached

For example, if you have performance issues (you are scrapping way too much links and you don't need the ones after some limit).

Given that Scrapy work in asymmetric requests, there is no real good way to do that. The only way you have is trying to force linear behavior instead of default parallel requests.

Let me explain. When you have two callbacks like that, on default behavior scrapy will first parse the first page (main page) and put in its queue all requests for the player pages. Without waiting for that first page to finish being scrapped, it will start treating these requests for player pages (not necessarily in the order it found them).

Therefore, when you get the information that the player page p is out of date, it has already sent internal requests for p+1, p+2...p+m (m is basically a random number) AND has probably started treating some of these requests. Possibly even p+1 before p (no fixed order, remember).

So no way to stop exactly at the right page if you keep this pattern, and no way to interact with parse from parsePlayer.

What you can do is force it to follow the links in order, so that you have full control. The drawback is that it will take a big toll on performance: if scrapy follows each link one after the other, it means it can't treat them simultaneously as it usually does and it slows things down.

The code could be something like:

def parse(self, response):
    sel = Selector(response)
    self.container = sel.css('div[MyDiv]')
    return self.increment(0)

# Function that will yield the request for player n°index
def increment(index):
    player = self.container[index] # select current player
    extract LINK and TITLE
    yield Request(LINK, meta={'Title': Title, 'index': index}, callback=self.parsePlayer)

def parsePlayer(self,response):
    player = new PlayerItem();
    extract DATE
    yield player

    if DATE == some_criteria:
        index = response.meta['index'] + 1 
        self.increment(index)

That way scrapy will get the main page, then the first player, then the main page, then the second player, then the main, etc... until it finds a date that doesn't fit the criteria. Then there is no callback to the main function and the spider stops.

This gets a little more complex if you have to also increment the index of the main page (if there are n main pages for example), but the idea stays the same.

133

answered Oct 05 '22 05:10

Robin

Related questions
                            
                                Returning data from Python to node.js
                            
                                How do I know what prior's I'm giving to sci-kit learn? (Naive-bayes classifiers.)
                            
                                How to obtain GridSearchCV partly finished results?
                            
                                How does "coding: pyxl" work in Python?
                            
                                How to figure out which Python keyword argument is missing?
                            
                                Applying cumulative mean function to a grouped object
                            
                                Julia set fractals
                            
                                OpenCV Video Capture with PyQt4
                            
                                Catch mouse button pressed signal from qComboBox popup menu
                            
                                Workaround for equality of nested functions
                            
                                Clean canvas in kivy language
                            
                                Python regex to match begin of string or whitespace
                            
                                How to create in one line a null vector of size 10 but the fifth value being 1 using numpy
                            
                                Markov chain stationary distributions with scipy.sparse?
                            
                                How do I build a tree dynamically in Python
                            
                                Set-ItemProperty message when importing virtualenvwrapper
                            
                                What is the new approach to get user profile in Django 1.6?
                            
                                Connect to SQL Server with user credentials of another domain
                            
                                Send key combination with python
                            
                                How to use Scapy to determine Wireless Encryption Type?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With