Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy shell gets 301 redirected to URL without parameters

Scrapy can request URLs with GET parameters to interactively explore the response:

scrapy shell "https://duckduckgo.com/?q=foo"

But with some websites, my request gets 301 redirected and the URL parameters are stripped:

DEBUG: Redirecting (301) to <GET http://foo.com/mypage/> 
  from <GET http://foo.com/mypage/?bar=baz>
DEBUG: Crawled (200) <GET http://foo.com/mypage/> (referer: None)

When I visit http://foo.com/mypage/?bar=baz in my browser as normal I don't get redirected and the GET parameters remain.

Can anyone suggest how I might avoid being redirected?

like image 786
Raj Avatar asked Jun 09 '14 11:06

Raj


1 Answers

Inspired by @paultrmbrth's answer in the comments, here's exactly how to get around this problem using User Agent spoofing.

First, find your browser's User Agent string (I did this using http://www.whatsmyuseragent.com/ but there may be other ways).

Mine was

Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0

Now in project_name/items.py add the following line:

USER_AGENT = "whatever the user agent string was"

and scrapy shell "http://foo.com/mypage/?bar=baz" will work as expected.

like image 151
LondonRob Avatar answered Nov 04 '22 14:11

LondonRob