Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy Shell - How to change USER_AGENT

I have a fully functioning scrapy script to extract data from a website. During setup, the target site banned me based on my USER_AGENT information. I subsequently added a RotateUserAgentMiddleware to rotate the USER_AGENT randomly. This works great.

However, now when I trying to use the scrapy shell to test xpath and css requests, I get a 403 error. I'm sure this is because the USER_AGENT of the scrapy shell is defaulting to some value the target site has blacklisted.

Question: is it possible to fetch a URL in the scrapy shell with a different USER_AGENT than the default?

fetch('http://www.test') [add something ?? to change USER_AGENT]

Thx

like image 474
dfriestedt Avatar asked Aug 21 '14 15:08

dfriestedt


People also ask

How do you get a Scrapy shell off?

Finally you hit Ctrl-D (or Ctrl-Z in Windows) to exit the shell and resume the crawling: >>> ^D 2014-01-23 17:50:03-0400 [scrapy.

How do you set a header in Scrapy?

You need to set the user agent which Scrapy allows you to do directly. import scrapy class QuotesSpider(scrapy. Spider): # ... user_agent = 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.

How use Scrapy shell Python?

Description. Scrapy shell can be used to scrap the data with error free code, without the use of spider. The main purpose of Scrapy shell is to test the extracted code, XPath, or CSS expressions. It also helps specify the web pages from which you are scraping the data.


2 Answers

scrapy shell -s USER_AGENT='custom user agent' 'http://www.example.com'

like image 164
marven Avatar answered Oct 07 '22 23:10

marven


Inside the scrapy shell, you can set the User-Agent in the request header.

url = 'http://www.example.com'
request = scrapy.Request(url, headers={'User-Agent': 'Mybot'})
fetch(request)
like image 36
salmanwahed Avatar answered Oct 08 '22 00:10

salmanwahed