Scrapy can request URLs with GET
parameters to interactively explore the response:
scrapy shell "https://duckduckgo.com/?q=foo"
But with some websites, my request gets 301
redirected and the URL parameters are stripped:
DEBUG: Redirecting (301) to <GET http://foo.com/mypage/>
from <GET http://foo.com/mypage/?bar=baz>
DEBUG: Crawled (200) <GET http://foo.com/mypage/> (referer: None)
When I visit http://foo.com/mypage/?bar=baz
in my browser as normal I don't get redirected and the GET
parameters remain.
Can anyone suggest how I might avoid being redirected?
Inspired by @paultrmbrth's answer in the comments, here's exactly how to get around this problem using User Agent spoofing.
First, find your browser's User Agent string (I did this using http://www.whatsmyuseragent.com/ but there may be other ways).
Mine was
Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0
Now in project_name/items.py
add the following line:
USER_AGENT = "whatever the user agent string was"
and scrapy shell "http://foo.com/mypage/?bar=baz"
will work as expected.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With