The scrapy framework (https://github.com/scrapy/scrapy) provides a library for use when logging into websites that require authentication, https://github.com/scrapy/loginform.
I have looked through the docs for both programs however I cannot seem to figure out how to get scrapy to call loginform before running. The login works fine with just loginform.
Thanks
Scrapy, being one of the most popular web scraping frameworks, is a great choice if you want to learn how to scrape data from the web.
Can I use Scrapy with BeautifulSoup? ¶ Yes, you can. As mentioned above, BeautifulSoup can be used for parsing HTML responses in Scrapy callbacks.
Executing JavaScript in Scrapy with ScrapingBee ScrapingBee is a web scraping API that handles headless browsers and proxies for you. ScrapingBee uses the latest headless Chrome version and supports JavaScript scripts. Like the other two middlewares, you can simply install the scrapy-scrapingbee middleware with pip.
loginform
is just a library, totally decoupled from Scrapy.
You have to write the code to plug it in the spider you want, probably in a callback method.
Here is an example of a structure to do this:
import scrapy
from loginform import fill_login_form
class MySpiderWithLogin(scrapy.Spider):
name = 'my-spider'
start_urls = [
'http://somewebsite.com/some-login-protected-page',
'http://somewebsite.com/another-protected-page',
]
login_url = 'http://somewebsite.com/login-page'
login_user = 'your-username'
login_password = 'secret-password-here'
def start_requests(self):
# let's start by sending a first request to login page
yield scrapy.Request(self.login_url, self.parse_login)
def parse_login(self, response):
# got the login page, let's fill the login form...
data, url, method = fill_login_form(response.url, response.body,
self.login_user, self.login_password)
# ... and send a request with our login data
return scrapy.FormRequest(url, formdata=dict(data),
method=method, callback=self.start_crawl)
def start_crawl(self, response):
# OK, we're in, let's start crawling the protected pages
for url in self.start_urls:
yield scrapy.Request(url)
def parse(self, response):
# do stuff with the logged in response
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With