How can I set the user agent for Scrapy with Splash in an equivalent way like below:
import requests
from bs4 import BeautifulSoup
ua = {"User-Agent":"Mozilla/5.0"}
url = "http://www.example.com"
page = requests.get(url, headers=ua)
soup = BeautifulSoup(page.text, "lxml")
My spider would look similar to this:
import scrapy
from scrapy_splash import SplashRequest
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["example.com"]
start_urls = ["https://www.example.com/"]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(
url,
self.parse,
args={'wait': 0.5}
)
You need to set user_agent
attribute to override default user agent:
class ExampleSpider(scrapy.Spider):
name = 'example'
user_agent = 'Mozilla/5.0'
In this case UserAgentMiddleware
(which is enabled by default) will override USER_AGENT
setting value to 'Mozilla/5.0'
.
You can also override headers per request:
scrapy_splash.SplashRequest(url, headers={'User-Agent': custom_user_agent})
The proper way is to to alter the splash script to included it... no add it to the spider though, if it works as well.
http://splash.readthedocs.io/en/stable/scripting-ref.html?highlight=agent
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With