Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using loginform with scrapy

The scrapy framework (https://github.com/scrapy/scrapy) provides a library for use when logging into websites that require authentication, https://github.com/scrapy/loginform.
I have looked through the docs for both programs however I cannot seem to figure out how to get scrapy to call loginform before running. The login works fine with just loginform.
Thanks

like image 258
ollierexx Avatar asked Apr 22 '15 21:04

ollierexx


People also ask

Is Scrapy good for web scraping?

Scrapy, being one of the most popular web scraping frameworks, is a great choice if you want to learn how to scrape data from the web.

Can I use Scrapy with BeautifulSoup?

Can I use Scrapy with BeautifulSoup? ¶ Yes, you can. As mentioned above, BeautifulSoup can be used for parsing HTML responses in Scrapy callbacks.

Can Scrapy handle JavaScript?

Executing JavaScript in Scrapy with ScrapingBee ScrapingBee is a web scraping API that handles headless browsers and proxies for you. ScrapingBee uses the latest headless Chrome version and supports JavaScript scripts. Like the other two middlewares, you can simply install the scrapy-scrapingbee middleware with pip.


1 Answers

loginform is just a library, totally decoupled from Scrapy.

You have to write the code to plug it in the spider you want, probably in a callback method.

Here is an example of a structure to do this:

import scrapy
from loginform import fill_login_form

class MySpiderWithLogin(scrapy.Spider):
    name = 'my-spider'

    start_urls = [
        'http://somewebsite.com/some-login-protected-page',
        'http://somewebsite.com/another-protected-page',
    ]

    login_url = 'http://somewebsite.com/login-page'

    login_user = 'your-username'
    login_password = 'secret-password-here'

    def start_requests(self):
        # let's start by sending a first request to login page
        yield scrapy.Request(self.login_url, self.parse_login)

    def parse_login(self, response):
        # got the login page, let's fill the login form...
        data, url, method = fill_login_form(response.url, response.body,
                                            self.login_user, self.login_password)

        # ... and send a request with our login data
        return scrapy.FormRequest(url, formdata=dict(data),
                           method=method, callback=self.start_crawl)

    def start_crawl(self, response):
        # OK, we're in, let's start crawling the protected pages
        for url in self.start_urls:
            yield scrapy.Request(url)

    def parse(self, response):
        # do stuff with the logged in response
like image 85
Elias Dorneles Avatar answered Sep 30 '22 18:09

Elias Dorneles