Using scrapy getting crawlspider to work with authenticated (logged in) user session

Tags:

Hello how can I get my crawlspider to work, I am able to login but nothing happens I don't really get not scrape. Also I been reading the scrapy doc and i really don't understand the rules to use to scrape. Why is nothing happening after "Successfully logged in. Let's start crawling!"

I also had this rule at the end of my else statement but remove it because it wasn't even being called because it was inside my else block. so I moved it at the top of start_request() method but got errors so i removed my rules.

 rules = (
                 Rule(extractor,callback='parse_item',follow=True),
                 )

my code:

from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule
from scrapy.contrib.spiders import CrawlSpider, Rule

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from linkedconv.items import LinkedconvItem

class LinkedPySpider(CrawlSpider):
    name = 'LinkedPy'
    allowed_domains = ['linkedin.com']
    login_page = 'https://www.linkedin.com/uas/login'
   # start_urls = ["http://www.linkedin.com/csearch/results?type=companies&keywords=&pplSearchOrigin=GLHD&pageKey=member-home&search=Search#facets=pplSearchOrigin%3DFCTD%26keywords%3D%26search%3DSubmit%26facet_CS%3DC%26facet_I%3D80%26openFacets%3DJO%252CN%252CCS%252CNFR%252CF%252CCCR%252CI"]
    start_urls = ["http://www.linkedin.com/csearch/results"]


    def start_requests(self):
    yield Request(
    url=self.login_page,
    callback=self.login,
    dont_filter=True
    )

  #  def init_request(self):
    #"""This function is called before crawling starts."""
  #      return Request(url=self.login_page, callback=self.login)

    def login(self, response):
    #"""Generate a login request."""
    return FormRequest.from_response(response,
            formdata={'session_key': '[email protected]', 'session_password': 'mypassword'},
            callback=self.check_login_response)

    def check_login_response(self, response):
    #"""Check the response returned by a login request to see if we aresuccessfully logged in."""
    if "Sign Out" in response.body:
        self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
        # Now the crawling can begin..
        self.log('Hi, this is an item page! %s' % response.url)

        return 

    else:
        self.log("\n\n\nFailed, Bad times :(\n\n\n")
        # Something went wrong, we couldn't log in, so nothing happens.


    def parse_item(self, response):
    self.log("\n\n\n We got data! \n\n\n")
    self.log('Hi, this is an item page! %s' % response.url)
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//ol[@id=\'result-set\']/li')
    items = []
    for site in sites:
        item = LinkedconvItem()
        item['title'] = site.select('h2/a/text()').extract()
        item['link'] = site.select('h2/a/@href').extract()
        items.append(item)
    return items

myoutput

C:\Users\ye831c\Documents\Big Data\Scrapy\linkedconv>scrapy crawl LinkedPy
2013-07-12 13:39:40-0500 [scrapy] INFO: Scrapy 0.16.5 started (bot: linkedconv)
2013-07-12 13:39:40-0500 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetCon
sole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-12 13:39:41-0500 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut
hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De
faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi
ddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-12 13:39:41-0500 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi
ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle
ware
2013-07-12 13:39:41-0500 [scrapy] DEBUG: Enabled item pipelines:
2013-07-12 13:39:41-0500 [LinkedPy] INFO: Spider opened
2013-07-12 13:39:41-0500 [LinkedPy] INFO: Crawled 0 pages (at 0 pages/min), scra
ped 0 items (at 0 items/min)
2013-07-12 13:39:41-0500 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
3
2013-07-12 13:39:41-0500 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-12 13:39:41-0500 [LinkedPy] DEBUG: Crawled (200) <GET https://www.linked
in.com/uas/login> (referer: None)
2013-07-12 13:39:42-0500 [LinkedPy] DEBUG: Redirecting (302) to <GET http://www.
linkedin.com/nhome/> from <POST https://www.linkedin.com/uas/login-submit>
2013-07-12 13:39:45-0500 [LinkedPy] DEBUG: Crawled (200) <GET http://www.linkedi
n.com/nhome/> (referer: https://www.linkedin.com/uas/login)
2013-07-12 13:39:45-0500 [LinkedPy] DEBUG:


    Successfully logged in. Let's start crawling!



2013-07-12 13:39:45-0500 [LinkedPy] DEBUG: Hi, this is an item page! http://www.
linkedin.com/nhome/
2013-07-12 13:39:45-0500 [LinkedPy] INFO: Closing spider (finished)
2013-07-12 13:39:45-0500 [LinkedPy] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 1670,
     'downloader/request_count': 3,
     'downloader/request_method_count/GET': 2,
     'downloader/request_method_count/POST': 1,
     'downloader/response_bytes': 65218,
     'downloader/response_count': 3,
     'downloader/response_status_count/200': 2,
     'downloader/response_status_count/302': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2013, 7, 12, 18, 39, 45, 136000),
     'log_count/DEBUG': 11,
     'log_count/INFO': 4,
     'request_depth_max': 1,
     'response_received_count': 2,
     'scheduler/dequeued': 3,
     'scheduler/dequeued/memory': 3,
     'scheduler/enqueued': 3,
     'scheduler/enqueued/memory': 3,
     'start_time': datetime.datetime(2013, 7, 12, 18, 39, 41, 50000)}
2013-07-12 13:39:45-0500 [LinkedPy] INFO: Spider closed (finished)

480

asked Jul 12 '13 18:07

Gio

1 Answers

Right now, the crawling ends in check_login_response() because Scrapy has not been told to do anything more.

1st request to the login page using start_requests(): OK
2nd request to POST the login information: OK
which response is parsed with check_login_response... and that's it

Indeed check_login_response() returns nothing. To keep the crawling going, you need to return Request instances (that tell Scrapy what pages to fetch next, see Scrapy documentation on Spiders' callbacks)

So, inside check_login_response(), you need to return a Request instance to the starting page containing the links you want to crawl next, probably some of the URLs you defined in start_urls.

    def check_login_response(self, response):
        #"""Check the response returned by a login request to see if we aresuccessfully logged in."""
        if "Sign Out" in response.body:
            self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
            # Now the crawling can begin..
            return Request(url='http://linkedin.com/page/containing/links')

By default, if you do not set a callback for your Request, the spider calls its parse() method (http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spider.BaseSpider.parse).

In your case, it will call CrawlSpider's built-in parse() method for you automatically, which applies the Rules you have defined to get next pages.

You must define your CrawlSpider rules within a rules attribute of you spider class, just as you did for name, allowed_domain etc., at the same level.

http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider-example provides example Rules. The main idea is that you tell the extractor what kind of absolute URL you are interested in within the page, using regular expression(s) in allow. If you do not set allow in your SgmlLinkExtractor, it will match all links.

And each Rule should have a callback to use for these links, in your case parse_item().

Good luck with parsing LinkedIn pages, I think a lot of what's in the pages is generated via Javascript and may not be inside the HTML content fetched by Scrapy.

200

answered Oct 03 '22 20:10

paul trmbrth

Related questions
                            
                                How to markup email address and telephone number
                            
                                How to submit form with GET method and using URI template?
                            
                                HTML code formatting in phpDesigner IDE version 7.0
                            
                                Mobile safari downsamples large images. How to retain?
                            
                                Wordpress: <body class="customize-support">
                            
                                Change mouse pointer on click
                            
                                How to send multipart/form data to web server from android?
                            
                                CSS: make container fit screen
                            
                                Strange behaviour when absolute positioning an INPUT with both left and right CSS properties
                            
                                stop scroll bar from covering div content with auto overflow
                            
                                Having problems uploading blob directly to s3
                            
                                jQuery animate decimal number increment/decrement
                            
                                Canvas covers page in Internet Explorer
                            
                                CSS3 polygons without images, how to?
                            
                                Eclipse - Which project to choose to create HTML/Javascript project
                            
                                how to set the X-Frame-Options in a content hosted by github?
                            
                                Default visual style in Chart.js bar chart
                            
                                How to find div inside another div using HtmlUnit?
                            
                                Change input type="text" to input type="password" onfocus()
                            
                                Serving image with flask

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using scrapy getting crawlspider to work with authenticated (logged in) user session

Tags:

html

scrapy

screen-scraping

login

Gio

People also ask

1 Answers

paul trmbrth

Recent Activity

Donate For Us