Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can use scrapy shell with url and basic auth credentials?

I want to use scrapy shell and test response data for url which requires basic auth credentials. I tried to check scrapy shell documentation but I couldn't find it there.

I tried with scrapy shell 'http://user:[email protected]' but it didn't work. Does anybody know how I can achieve it?

like image 960
Rohanil Avatar asked Mar 16 '17 02:03

Rohanil


People also ask

What is Scrapy shell and how to use it?

Scrapy comes equipped with a shell, that has different uses. In this article, we will learn about Scrapy Shell. Scrapy, comes along with an interactive shell that allows to run simple commands, scrape data without using spider code, and allows test the written expressions.

How to do login procedures in Scrapy?

To do the simplest of login procedures in Scrapy we can use Scrapy’s FormRequest class. Actually it’s better using one of FormRequests methods to do the form data but more on that later on! With that lets see how this works first and then build on that. To use it in our scrapy spider we have to import it first.

How does a spider authenticate before scraping?

If you look here, there's an example of a spider that authenticates before scraping. In this case, it handles things in the parse function (the default callback of any request). So, whenever a request is made, the response is checked for the presence of the login form.

How do I use relative file paths in Scrapy?

When using relative file paths, be explicit and prepend them with ./ (or ../ when relevant). scrapy shell index.html will not work as one might expect (and this is by design, not a bug).


2 Answers

if you want to use only the shell, you could do something like this:

$ scrapy shell

and inside the shell:

>> from w3lib.http import basic_auth_header
>> from scrapy import Request
>> auth = basic_auth_header(your_user, your_password)
>> req = Request(url="http://example.com", headers={'Authorization': auth})
>> fetch(req)

as fetch uses the current request to update the shell session.

like image 123
eLRuLL Avatar answered Oct 18 '22 05:10

eLRuLL


Yes with httpauth middleware.

Make sure HTTPAuthMiddleware is enabled in the settings then just define:

class MySpider(CrawSpider):
    http_user = 'username'
    http_pass = 'password'
    ...

as class variables in your spider.

Also, you don't need to specify the login credentials in the url if the middleware has been enabled in the settings.

like image 45
Verbal_Kint Avatar answered Oct 18 '22 04:10

Verbal_Kint